Whisper AI

Whisper is an open source English speech recognition released by Open AI. A quick review, some testing and throw it some curve balls.

Introduction

Open AI, the company behind the text to image AI Dall-E, has released an open source neural network trained to recognize English language called Whisper.

It can listen to audio and transcribe it into text, but it is not well suited to real time speech, so not an Alex or Siri competitor at the moment, unless you want to load it onto a big GPU.

It is available as Python library and source code is available here:
https://github.com/openai/whisper

Commentary

There is some interesting discussion on Y Combinator’s Hacker News when it was announced, you can read it here.

Some of the more interesting points:

  • This will (largely) gut the existing English speech to text transcription market, Whisper is now good enough for many people.
  • “It’s one model and in a non-strategic area where there are existing open source projects” link
    In other words, it was unlikely to find a paying market and there was no downside to releasing it Open Source.
  • “This kind of model is harder to abuse” link
    Unlike text to image, where objectionable material could be created, speech to text is a lower risk, less likely to be used by bad actors.
  • It is released under the MIT license, so it could be included in commercial products.
    MIT licence summary here.

Testing

We are going to run some basic tests, using well known speeches. The expectation is that they were most likely part the of the training data set, so should be no problem for Whisper.

But then we will get tricky and throw some and audio from YouTube, including music and cartoon characters with speech impediments. The goal is to test it with somethings it might not have been part of the training data set.

Dependencies and Installation

The following are assumptions and dependencies:

  • MacOS, 12.6 (Monterey)
  • Python3, 3.10.6
  • Brew
  • ffmpeg
  • youtube-dl

To install Brew please refer to the installation instructions:
https://docs.brew.sh/Installation

To install Python 3 either download the installer from https://www.python.org/ or use Brew:

brew install python3

To confirm it is installed correctly run:

python3 -V
Python 3.10.6

Install ffmpeg and youtube-dl via Brew:

brew install ffmpeg youtube-dl

Install Whisper via pip3:

pip3 install git+https://github.com/openai/whisper.git 

Confirm installation:

whisper --help

If installed, a long list of command line options should appear.

Testing Part 1: JFK

This first part should be a simple example of a successful test. Download the jfk speech that is part of the Whisper test suite:
https://github.com/openai/whisper/raw/main/tests/jfk.flac

Next in the directory you downloaded the jfk file, run:

whisper "tests_jfk.flac" --model tiny.en --language English

If it works, you should see it (eventually) output:

[00:00.000 --> 00:08.000]  And so, my fellow Americans ask not what your country can do for you
[00:08.000 --> 00:15.000]  ask what you can do for your country.

If you get a message like the following, it is likely because you are running it on CPU, not a GPU. This is nothing to worry about, but it may run slower.

/usr/local/lib/python3.10/site-packages/whisper/transcribe.py:78: UserWarning: FP16 is not supported on CPU; using FP32 instead warnings.warn("FP16 is not supported on CPU; using FP32 instead")

There are different models that are available, that are different sizes and accuracies. Generally, the larger the model the more accurate. As the models need to be loaded into memory, the start up time will be longer for larger models.

Testing Part 2: Churchill

Let’s try something longer, here is a famous Winston Churchill speech from WWII, download:
Winston Churchill, We Shall Never Surrender

Now we will do a quick edit with ffmpeg to just grab the last part of the speech, which is the perhaps the best known part.

ffmpeg -ss 00:11:26.30 -to 00:12:10.00 \
-i 1940-06-04_BBC_Winston_Churchill_We_Shall_Never_Surrender.mp3 \
1940-06-04_BBC_Winston_Churchill_We_Shall_Never_Surrender-edit.mp3

Now run Whisper.

whisper "1940-06-04_BBC_Winston_Churchill_We_Shall_Never_Surrender-edit.mp3" --model tiny.en --language English
[00:00.000 --> 00:07.000]  We shall fight on the beaches, we shall fight on the landing grounds, we shall fight in the fields,
[00:07.000 --> 00:13.000]  and in the streets, we shall fight in the hills, we shall never surrender.
[00:13.000 --> 00:22.000]  And if, because I do not for a moment believe, this island or large part of it was subjugated
[00:22.000 --> 00:32.000]  starving, then our empire beyond the seas, armed and guarded by the British feet, would carry on the struggle
[00:32.000 --> 00:35.000]  until in God's good time.
[00:35.000 --> 00:43.000]  The new world with all its power might step forth to the rescue and the liberation of the old.

Almost perfect!

Performance

A quick note on performance, the faster your CPU, the faster the transcription. If you have a GPU and it can be used, it will dramatically boost your time to convert to text.

Testing Part 3: Tom’s Diner

Now for something more interesting. Tom’s Diner by Suzane Vega has some interesting history, it was used as a reference track during the creation of the mp3 file format.

There is also an acapella version and several other versions & mixes.

So let’s see how Whisper does?

Now, if you happen to have a copy of Tom’s Diner Acapella, go ahead and use that.

If you don’t, let me introduce you to your new best friend, youtube-dl. The following will download from YouTube Tom’s Diner video and save 20 seconds of it (falling with in fair use).

Suzanne Vega, Tom’s Diner, Acapella:
https://www.youtube.com/watch?v=mto47BMT3yA

youtube-dl -f bestaudio \
--extract-audio --audio-format mp3 --audio-quality 0 \
https://www.youtube.com/watch?v=mto47BMT3yA \
--no-check-certificate --output "toms-diner-acapella.mp3" \
--external-downloader ffmpeg \
--external-downloader-args "-ss 00:00:02.00 -to 00:00:22.00"
whisper "toms-diner-acapella.mp3" --model tiny.en --language English
[00:00.000 --> 00:06.000]  I am sitting in the morning at the diner on the corner
[00:06.000 --> 00:11.000]  I am waiting at the counter for the man to report the coffee
[00:11.000 --> 00:13.000]  And he fills it only halfway
[00:13.000 --> 00:18.000]  And before I even argue he is looking out the window
[00:18.000 --> 00:45.000]  It's somebody coming in

Cool, it does mess up the end, but that is mostly because Whisper uses context and there is limited context at the end of the section (and the lyrics are a little awkward).

But wait! We are using the tiny.en model, let’s try a bigger model to see if that is more accurate:

whisper "toms-diner-acapella.mp3" --model base.en --language English
[00:00.000 --> 00:06.000]  I am sitting in the morning at the diner on the corner
[00:06.000 --> 00:11.000]  I am waiting at the counter for the man to pour the coffee
[00:11.000 --> 00:15.000]  And he fills it only halfway and before I even argue
[00:15.000 --> 00:27.000]  He is looking out the window at somebody coming in

Nailed it!

Okay, so the base.en is able to understand the acapella perfectly. Now let’s use the ‘song’ version, it has extra noise which should be more challenging.

Suzanne Vega, Tom’s Diner:
https://www.youtube.com/watch?v=-26hsZqwveA

youtube-dl -f bestaudio \
--extract-audio --audio-format mp3 --audio-quality 0 \
https://www.youtube.com/watch?v=-26hsZqwveA \
--no-check-certificate --output "toms-diner-song.mp3" \
--external-downloader ffmpeg \
--external-downloader-args "-ss 00:00:34.00 -to 00:00:54.00"

Let’s try the base.en model:

whisper "toms-diner-song.mp3" --model base.en --language English
[00:00.000 --> 00:05.000]  I am sitting in the morning at the diner on the corner
[00:05.000 --> 00:10.000]  I am waiting at the counter for the man to pour the coffee
[00:10.000 --> 00:15.000]  And he fills it only halfway and before I even argue
[00:15.000 --> 00:41.000]  Is looking out the window at somebody coming in

Almost perfect, last line should be ‘he is…’ not just ‘Is…’

Lets test the next size model, small.en:

whisper "toms-diner-song.mp3" --model small.en --language English
[00:00.000 --> 00:05.000]  I am sitting in the morning at the diner on the corner
[00:05.000 --> 00:10.000]  I am waiting at the counter for the man to pour the coffee
[00:10.000 --> 00:12.000]  And he feels it only halfway
[00:12.000 --> 00:15.000]  And before I even argue
[00:15.000 --> 00:31.000]  He is looking out the window at somebody coming in...

The small.en model gets it right.

Testing Part 4: Bugs, Daffy and Tweety

Now for something that should push the AI a bit harder, Bugs Bunny:

Warner Bros. Classic Cartoon Characters: Bugs Bunny
https://www.youtube.com/watch?v=14KTu4i27j8

youtube-dl -f bestaudio \
--extract-audio --audio-format mp3 --audio-quality 0 \
https://www.youtube.com/watch?v=14KTu4i27j8 \
--no-check-certificate --output "bugs-bunny.mp3" \
--external-downloader ffmpeg \
--external-downloader-args "-ss 00:00:03.00 -to 00:00:30.00"

Let’s use the small.en model right away, as it should have the best chance of success.

whisper "bugs-bunny.mp3" --model small.en --language English
[00:00.000 --> 00:04.000]  I knew I should have taken that left coin of Albuquerque.
[00:05.000 --> 00:09.000]  Oh, well, I'll just ask this gent in the fancy knickerbockers.
[00:13.000 --> 00:20.000]  Pardon me, sir, but could you direct me to the shortest route to the Coachella Valley and the Big Carrot Festival, Darien?
[00:20.000 --> 00:26.000]  Me? What's up, dog?

“that left coin of Albuquerque”, “What’s up, dog?”, well, no, not really.

Bump it to the medium.en model:

whisper "bugs-bunny.mp3" --model medium.en --language English
[00:00.000 --> 00:04.800]  I knew I should have taken that left toilet albacookie.
[00:04.800 --> 00:08.640]  Oh, well, I'll just ask this genton to fancy knickerbockers.
[00:08.640 --> 00:10.440]  Eh, I beg your pardon...
[00:12.440 --> 00:15.140]  Eh, eh, pardon me, sir, but could you direct me
[00:15.140 --> 00:17.440]  to the shortest route to the Coachella Valley
[00:17.440 --> 00:20.140]  of the Big Carrot Festival, there in...
[00:20.140 --> 00:27.140]  Eh, what's up, Doc?

Much better, still with the “that left toilet albacookie”, but medium.en seems to manage Bug’s accent and characteristic speech style a bit better?

Next Daffy, which should be an interesting test.

Looney Tunes | Duck Amuck | Classic Cartoon | WB Kids
https://www.youtube.com/watch?v=6XvXsuSJ-1A

youtube-dl -f bestaudio \
--extract-audio --audio-format mp3 --audio-quality 0 \
https://www.youtube.com/watch?v=6XvXsuSJ-1A \
--no-check-certificate --output "daffy-duck.mp3" \
--external-downloader ffmpeg \
--external-downloader-args "-ss 00:00:00.00 -to 00:00:23.00"
whisper "daffy-duck.mp3" --model small.en --language English
[00:00.000 --> 00:04.000]  Stand back musketeers! They shall sample my blade!
[00:04.000 --> 00:05.000]  Touché!
[00:09.000 --> 00:10.000]  Musketeers?
[00:10.000 --> 00:13.000]  Hmm? My guard? My blade?
[00:15.000 --> 00:19.000]  Hey! Psst! Whoever's in charge here!
[00:19.000 --> 00:30.000]  A scenery! Where's the scenery?

It looks like Whisper is a fan of Daffy Duck, the small.en module had no problem with his characteristic lisssssp.

Testing Part 5: Cypress Hill

Admittedly this is a little out out there, but let’s see how Whisper manages the now classic Insane in the Brain by Cypress Hill.

Cypress Hill, Insane in the Brain
https://www.youtube.com/watch?v=XOcyKObnwg4

youtube-dl -f bestaudio \
--extract-audio --audio-format mp3 --audio-quality 0 \
https://www.youtube.com/watch?v=XOcyKObnwg4 \
--no-check-certificate --output "cypress-hill.mp3" \
--external-downloader ffmpeg \
--external-downloader-args "-ss 00:00:23.00 -to 00:00:43.00"

Let’s go right to the medium.en model:

whisper "cypress-hill.mp3" --model medium.en --language English
[00:00.000 --> 00:05.000]  Feel the one on the flam Boy in temper, just toss that ham in the frying pan
[00:05.000 --> 00:10.000]  Like spam, get done when I come and slam Damn, I feel like the son of Sam
[00:10.000 --> 00:15.000]  Don't make me wreck, hectic, next to the chair Got me going like General Electric
[00:15.000 --> 00:31.000]  And the lights are blinking I'm thinking it's all over when I go out drinking

Eh, close enough?

https://genius.com/Cypress-hill-insane-in-the-brain-lyrics

Conclusion

I hope this has been interesting and shows some of the functionality and limitations of Whisper.

It is an interesting tool and I would expect to see it included in projects in the future, or projects leveraging it.

Comments are closed.