
# Vosk Colab Demo

Vosk is an open source offline speech recognition toolkit. Vosk 
contains more than 20 languages and dialects, such as English, German, Russian, Chinese, Czech, etc. The sizes of language models vary from tens of megabytes to several gigabytes. Big models are more accurate. For more information see https://alphacephei.com/vosk/.



This notebook demonstrates Vosk recognition capabilities.

# Install module and prepare the file

First, you have to install vosk module using the following code:

In [1]:
!pip3 install vosk

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vosk
 Downloading vosk-0.3.44-py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (7.2 MB)
[K |████████████████████████████████| 7.2 MB 29.3 MB/s 
Collecting srt
 Downloading srt-3.5.2.tar.gz (24 kB)
Building wheels for collected packages: srt
 Building wheel for srt (setup.py) ... [?25l[?25hdone
 Created wheel for srt: filename=srt-3.5.2-py3-none-any.whl size=22487 sha256=1bba28757dd764450db53d963f0db37d1d03fcf8dadf68eaea4c6159e6b529f5
 Stored in directory: /root/.cache/pip/wheels/54/c4/ec/4604122e072aebb16803c8297b7cd3f4c72073a3ee58738015
Successfully built srt
Installing collected packages: srt, vosk
Successfully installed srt-3.5.2 vosk-0.3.44


## Importing the necessary modules

Secondly, we import here the necessary modules required for all the examples below:

In [23]:
from vosk import Model, KaldiRecognizer
import wave
import json

## Download example audio file

You can upload your audio file and listen it by replacing the URL of our example with your own using the code below.

In [21]:
!wget -q -O /content/test.wav https://github.com/alphacep/vosk-api/raw/master/python/example/test.wav


In [22]:
import IPython
IPython.display.Audio("/content/test.wav")

# Recognition examples



By default, Vosk uses vosk-model-small-en-us-0.15, defined by the `en-us` lang option. The other options `model_path` and `model_name` allow you to use a specific model path or model name. 

When a model is mentioned for the first time, it is automatically downloaded and saved; when a model is mentioned again, an already downloaded model is used.

Initializing the model by language:


In [7]:
model = Model(lang="en-us")

vosk-model-small-en-us-0.15.zip: 100%|██████████| 39.3M/39.3M [00:03<00:00, 13.0MB/s]


Open downloaded file in 'read bytes' mode as wave object:

In [None]:
wf = wave.open('/content/test.wav', 'rb')

The KaldiRecognizer class contains the configuration methods needed here, such as SetWords, SetPartialWords, AcceptWaveform, and others.

The model object is the first parameter for KaldiRecognizer. The second parameter passed to KaldiRecognizer is the sample rate, which can be passed directly as a number like 8000 or 16000 Hz, which will be demonstrated below or using getframerate method shown in the following code fragment.

Creating a KaldiRecognizer object with model and sample rate arguments:

In [8]:
rec = KaldiRecognizer(model, wf.getframerate())

The previous commands are the same for the most of examples, but the following are different.

Activating timestamps for recognized words (partial result and result attributes in recognized result) using methods `SetWords` and `SetPartialWords`:

In [9]:
rec.SetWords(True)
rec.SetPartialWords(True)

The `AcceptWaveform` method reports the presence of a pause after a speech fragment in the audio file, which allows it to be returned from the recognizer and print.

`KaldiRecognizer` class also contains methods for presenting recognition results, such as `Result`, `PartialResult`, `FinalResult`. 


> The `PartialResult` method of the `KaldiRecognizer` class returns a string obtained from the dictionary with the "key" "partial", and the "value" that contains recognized fragment of the audio file, which ends with a pause between words.

> The `Result` method of the `KaldiRecognizer` class returns a string obtained from the dictionary with the "key" "text", and the "value" that contains recognized fragment of the audio file, which ends with a pause between its parts like phrases and sentences.

> The `FinalResult` method of the `KaldiRecognizer` class returns a string obtained from the dictionary with the "key" "text" and the "value" that contains all the recognized text.

Run recognition process:

In [10]:
while True:
 data = wf.readframes(4000)
 if len(data) == 0:
 break
 if rec.AcceptWaveform(data):
 print(rec.Result())
 else:
 print(rec.PartialResult())

print(rec.FinalResult())

{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : ""
}
{
 "partial" : "one zero zero",
 "partial_result" : [{
 "conf" : 1.000000,
 "end" : 1.110000,
 "start" : 0.840000,
 "word" : "one"
 }, {
 "conf" : 1.000000,
 "end" : 1.530000,
 "start" : 1.110000,
 "word" : "zero"
 }, {
 "conf" : 1.000000,
 "end" : 1.890000,
 "start" : 1.530000,
 "word" : "zero"
 }]
}
{
 "partial" : "one zero zero",
 "partial_result" : [{
 "conf" : 1.000000,
 "end" : 1.110000,
 "start" : 0.840000,
 "word" : "one"
 }, {
 "conf" : 1.000000,
 "end" : 1.530000,
 "start" : 1.110000,
 "word" : "zero"
 }, {
 "conf" : 1.000000,
 "end" : 1.890000,
 "start" : 1.530000,
 "word" : "zero"
 }]
}
{
 "result" : [{
 "conf" : 1.000000,
 "end" : 1.110000,
 "start" : 0.840000,
 "word" : "one"
 }, {
 "conf" : 1.000000,
 "end" : 1.530000,
 "start" : 1.110

## Recognition with alternatives

Run the initial code that was described above:

In [32]:
wf = wave.open('/content/test.wav', 'rb')
model = Model(lang="en-us")
rec = KaldiRecognizer(model, wf.getframerate())
rec.SetWords(True)

`SetMaxAlternatives(n)` method of the `KaldiRecognizer` class shows no more than 'n' different alternatives of the recognized result, which may appear, for example, due to the low quality of the audio file.

In [12]:
rec.SetMaxAlternatives(10)

The recognition result is converted from a string to a dictionary, which is more convenient for its further processing using the json.loads method.

Run recognition process:

In [13]:
while True:
 data = wf.readframes(4000)
 if len(data) == 0:
 break
 if rec.AcceptWaveform(data):
 print(json.loads(rec.Result()))
 else:
 print(json.loads(rec.PartialResult()))

print(json.loads(rec.FinalResult()))

{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': 'one'}
{'partial': 'one zero'}
{'partial': 'one zero zero'}
{'partial': 'one zero zero'}
{'partial': 'one zero zero zero'}
{'partial': 'one zero zero zero one'}
{'partial': 'one zero zero zero one'}
{'partial': 'one zero zero zero one'}
{'partial': 'one zero zero zero one'}
{'alternatives': [{'confidence': 265.527069, 'result': [{'end': 1.11, 'start': 0.84, 'word': 'one'}, {'end': 1.53, 'start': 1.11, 'word': 'zero'}, {'end': 1.92, 'start': 1.53, 'word': 'zero'}, {'end': 2.31, 'start': 1.92, 'word': 'zero'}, {'end': 2.61, 'start': 2.31, 'word': 'one'}], 'text': 'one zero zero zero one'}]}
{'partial': ''}
{'partial': ''}
{'partial': 'nah no'}
{'partial': 'nah no'}
{'partial': 'nah no to'}
{'partial': 'nah no to i know'}
{'partial': 'nah no to i know'}
{'partial': 'nah no to i know'}
{'alternatives': [{'confidence': 174.606827, 'result': [{'end': 4.11, 'start': 3.93, 'word': 'nah'}, {'end': 4.29, 

## Grammar recognizer


Now lets demonstrate online grammar to improve accuracy.

In [30]:
wf = wave.open('/content/test.wav', "rb")
rec = KaldiRecognizer(model, wf.getframerate(), '["one zero zero zero one", "nine oh two one oh", "zero one eight zero three", "[unk]"]')

Using this recognizer we can get more acccurate results since we already specified the expected input 

In [31]:
while True:
 data = wf.readframes(4000)
 if len(data) == 0:
 break
 if rec.AcceptWaveform(data):
 print(rec.Result())
 else:
 jres = json.loads(rec.PartialResult())
 print(jres)


{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': ''}
{'partial': 'one'}
{'partial': 'one zero'}
{'partial': 'one zero'}
{'partial': 'one zero zero'}
{'partial': 'one zero zero'}
{'partial': 'one zero zero zero'}
{'partial': 'one zero zero zero one'}
{'partial': 'one zero zero zero one'}
{'partial': 'one zero zero zero one'}
{
 "text" : "one zero zero zero one"
}
{'partial': ''}
{'partial': 'one'}
{'partial': 'nine'}
{'partial': 'nine oh two'}
{'partial': 'nine oh two one'}
{'partial': 'nine oh two one oh'}
{'partial': 'nine oh two one oh'}
{'partial': 'nine oh two one oh'}
{
 "text" : "nine oh two one oh"
}
{'partial': 'one'}
{'partial': 'one'}
{'partial': ''}
{'partial': 'zero'}
{'partial': 'zero one'}
{'partial': 'zero one eight'}
{'partial': 'zero one eight zero'}
{'partial': 'zero one eight zero'}
{'partial': 'zero one eight zero three'}
{'partial': 'zero one eight zero three'}
{'partial': 'zero one eight zero three'}
