1. Introduction

Sometimes we need to extract speech from audio files. For this, we may first recognize the words from the audio files with the help of a pre-trained model. As a result, we get sentences in the output, which we can then use for our needs.

In this tutorial, we’ll explore various tools designed to convert audio files to text. A few perform this conversion directly and others require an initial MP3 to WAV conversion.

2. Converting MP3 to WAV

Speech recognition tools excel with uncompressed 16-bit mono WAV files due to their fidelity and compatibility. Converting MP3 to WAV ensures optimal quality and aligns with most tools’ specific processing requirements. Hence, let’s see how to convert MP3 files to WAV first.

First, let’s install ffmpeg via apt-get and sudo:

$ sudo apt-get install ffmpeg

Once installed, we convert the MP3 file to the required WAV format:

$ ffmpeg -i sound.mp3 -ar 16000 -ac 1 convertedFile.wav
...
      encoder         : Lavc60.3.100 pcm_s16le
size=    6690kB time=00:03:34.07 bitrate= 256.0kbits/s speed= 406x
video:0kB audio:6690kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.002511%

Let’s break down this conversion:

  • -i specifies the input file
  • -ar 16000 sets the audio sample rate to 16000 Hz
  • -ac 1 sets the audio channels to mono

In summary, this ffmpeg command takes an input MP3 file named sound.mp3 and converts it into convertedFile.wav. During the conversion, it sets the audio sample rate to 16000 Hz and uses a single audio channel, resulting in a mono WAV file.

Again, this conversion is commonly expected by speech recognition tools.

3. Using vosk

vosk is a speech recognition toolkit based on neural networks. In addition, it provides a Python API for integrating speech recognition capabilities into scripts and applications. The key feature of vosk is its ability to transcribe spoken language into written text using pre-trained neural network models.

Notably, vosk doesn’t have the inherent capability to directly convert MP3 format audio to text, so we use our convertedFile.wav file.

3.1. Installing vosk and Models

In this case, we use pre-created Python scripts that facilitate the conversion process. In practice, we call on the vosk speech recognition capabilities, and the model provides the neural network expertise needed for accurate transcription.

First, let’s install vosk from the package installer for Python (pip):

$ pip3 install vosk

In particular, we installed the vosk-api Python package which includes the vosk library itself but doesn’t include the pre-trained models.

Now, we clone the vosk-api repository via git and navigate to the example directory:

$ git clone https://github.com/alphacep/vosk-api
$ cd vosk-api/python/example

After cloning the repository, we can see all available scripts:

$ ls -l
total 67492
...
-rwxr-xr-x 1 amir amir      529 Nov 25 18:53 test_srt.py
-rwxr-xr-x 1 amir amir      668 Nov 25 18:53 test_text.py
-rwxr-xr-x 1 amir amir     1770 Nov 25 18:53 test_webvtt.py
-rwxr-xr-x 1 amir amir      864 Nov 25 18:53 test_words.py

These scripts include tests and demonstrations of various features and functionalities of the vosk speech recognition library in different scenarios.

Now, let’s download and unzip the small English model:

$ wget https://alphacephei.com/kaldi/models/vosk-model-small-en-us-0.3.zip
$ unzip vosk-model-small-en-us-0.3.zip
$ mv vosk-model-small-en-us-0.3 model

Firstly, we download the model for English speech recognition via wget. Secondly, we unzip the downloaded file. Thirdly, we change the name of the extracted directory to model.

Of course, the choice of model depends on the specific requirements of the application, considering factors like resource constraints and language support. Hence, we can explore and download models for different languages based on our needs.

3.2. Converting to Text

At this time, we can start the conversion via Python:

$ python3 test_text.py convertedFile.wav
...
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
...
learn english super fast
learn english by focusing on content
not grammar
what do i mean by content
i mean learn english by focusing on meaningful

In this case, we used the test_text.py script to recognize texts from the WAV file which is displayed in the last lines of the output.

Of course, we can also use another text format such as srt:

$ python3 test_srt.py convertedFile.wav
1
00:00:02,220 --> 00:00:04,470
learn english super fast

2
00:00:05,580 --> 00:00:08,100
learn english by focusing on content ...

Here, we used test_srt.py to convert convertedFile.wav to a srt format which is commonly used for displaying subtitles in videos. It’s a format that timestamps text for video subtitles, allowing for synchronized display alongside corresponding audio.

4. Using pocketsphinx

pocketsphinx is a lightweight speech recognition engine designed for resource-constrained environments.

Notably, as pocketsphinx doesn’t directly convert MP3 to text, we use a prior conversion to a WAV format.

First, let’s install pocketsphinx:

$ sudo apt-get install pocketsphinx

After the WAV conversion, we can start the recognition via pocketsphinx_continuous:

$ pocketsphinx_continuous -infile convertedFile.wav > output.txt
INFO: pocketsphinx.c(151): Parsed model-specific feature parameters
...
INFO: ngram_search.c(301): TOTAL bestpath 0.23 CPU 0.001 xRT
INFO: ngram_search.c(304): TOTAL bestpath 0.24 wall 0.001 xRT

In this case, we run the pocketsphinx speech recognition system in continuous mode. We used the -infile option to specify the input audio file. Also, the result is saved to a text file output.txt. Additionally, at the end of the output, we’ll see information regarding the recognition process, including CPU time.

Alternatively, we can use the pocketsphinx_batch command for batch processing:

$ pocketsphinx_batch -adcin yes -cepdir wave_directory_files

Here, the -adcin option suggests that audio data will be provided through stdin. Also, -cepdir specifies the directory containing the input WAV files for speech recognition.

5. Using spchcat

spchcat is a command-line utility designed to process audio input from WAV files, a microphone, or system audio, converting identified speech into text. It operates exclusively on the local machine, without relying on web API calls or other network interactions.

Notably, as spchcat doesn’t directly transcribe MP3 to text, we initiate a preliminary conversion to WAV format.

First, let’s download spchcat via wget:

$ wget https://github.com/petewarden/spchcat/releases/download/v0.0.2-alpha/spchcat_0.0-2_amd64.deb

Once downloaded, we install it:

$ sudo dpkg -i spchcat_0.0-2_amd64.deb

Then, we can start the conversion:

$ spchcat convertedFile.wav
TensorFlow: v2.3.0-14-g4bdd3955115
Coqui STT: v1.1.0-0-gf3605e23
...
learn english super fast
learn english by focusing on content
not grammar
what do i mean by content
...

Here, we perform speech recognition on the convertedFile.wav audio file using both TensorFlow and Coqui STT. The recognized text is displayed in the second part of the output.

Of course, the default is English, but we can set another language:

$ spchcat --language=it_IT

In this case, we changed the default language to Italian.

6. Using whisper

whisper is a versatile speech recognition model trained on a diverse audio dataset. Hence, it’s equipped to handle various speech-processing tasks such as multilingual speech recognition, speech translation, spoken language identification, and voice activity detection.

Notably, whisper can directly convert audio from MP3 to text, streamlining the transcription process without the need for intermediate format conversion.

First, let’s install whisper:

$ pip install openai-whisper

Then, we can start the conversion:

$ whisper sound.mp3
Detecting language using up to the first 30 seconds. Use `--language` to specify the language
Detected language: English
[00:00.000 --> 00:12.440]  Learn English super fast. Learn English by focusing on content, not grammar.
[00:12.440 --> 00:19.840]  What do I mean by content? I mean learn English by focusing on meaningful communication and
[00:19.840 --> 00:27.680]  meaningful information. Most schools and most students focus on the mechanics of the language.
...

Of course, we can use whisper to convert other languages:

$ whisper italian.mp3 --language Italian
[00:00.000 --> 00:05.880] Ciao a tutti e bentornati, oppure benvenuti sul mio canale.
[00:05.880 --> 00:08.380] Attivate i sottotitoli.
[00:10.220 --> 00:14.620] Oggi facciamo un altro esercizio di conversazione.
[00:14.620 --> 00:21.980] Ci concentriamo su un dialogo estremamente facile, molto semplice,
[00:21.980 --> 00:27.260] con delle frasi che spesso si danno per scontate,$ whisper italian.wav --language Italian

In this case, we used –language Italian to detect any Italian-language speech.

Moreover, we can even translate the output from Italian into English:

$ whisper italian.mp3 --language Italian --task translate
[00:00.000 --> 00:05.920]  Hello everyone and welcome back, or welcome to my channel!
[00:05.920 --> 00:08.240]  Activate the subtitles!
[00:10.240 --> 00:14.640]  Today we do another conversation exercise.
[00:14.640 --> 00:22.000]  We focus on an extremely easy dialogue, very simple,
[00:22.000 --> 00:27.280]  with phrases that often get me lost,

Here, we used the –task translate option for translating the recognized output text to English.

7. Conclusion

In this article, we delved into a variety of tools for speech recognition from MP3 files, i.e., converting audio to text.

In general, we looked at vosk, pocketsphinx, spchcat, and whisper. Choosing whisper provides a straightforward, all-in-one conversion process, whereas other tools offer a more intricate yet adaptable method, catering to specific needs with a broader range of features and configurations.