Openai, the company behind image-generation and meme-spawning program DALL-E and the powerful text autocomplete engine GPT-3, has launched a new, open-sourced neural network meant to transcribe audio into written text. The company says that it can recognize Spanish, Italian, and Japanese, and that it can also translate them.

I thought I would be able to write my own app to securely transcript audio from my computer when I heard about this news. There are some interviews where I would feel more comfortable if the audio file was off the internet.

I already have a lot of developer tools on my computer, so installing Whisper was as easy as running a single terminal command. I was able to use Whisper within 15 minutes to make a transcript of the test audio. It would probably take more than an hour for someone who didn't already have any of the above to set it up. Someone is working to make the process much simpler and user-friendly.

Command-line apps obviously aren’t for everyone, but for something that’s doing a relatively complex job, Whisper’s very easy to use.
Command-line apps obviously aren’t for everyone, but for something that’s doing a relatively complex job, Whisper’s very easy to use.

Openai definitely saw this use case as a possibility, but it seems the company is mainly targeting researchers and developers with this release. The team hopes thatWhisper's high accuracy and ease of use will allow developers to add voice interface to a much wider set of applications. The company has limited access to its most popular machine-learning projects like DALL-E or GPT-3 because they want to learn more about real-world use.

Image showing a text file with the transcribed lyrics for Yung Gravy’s song “Betty (Get Money).” The transcription contains many inaccuracies.
The text files Whisper produces aren’t exactly the easiest to read if you’re using them to write an article, either.

It is not easy to install Whisper for most people. A journalist and a developer are working together to create a free, secure, and easy-to-use transcription app for journalists. He told me that he decided the program should exist after he ran some interviews through it and found it to be the best he had ever transcribed.

I said that it was relatively comparable to what Trint and Otter.ai put out for the same file. I wouldn't just copy and paste quotes from them into an article without double-checking the audio, no matter what service you're using. The version from Whisper would do the job for me, it would allow me to search through it to find the sections I need, and then double check them manually. Stage Whisper should perform the same since it will be using the same model and with a GUI.

Stage Whisper could be obsolete within a few years, thanks to tech from Apple and Google, and a version of that feature is starting to roll out to some other phones. We can't wait that long. Good auto-transcription apps are needed by journalists. He wants to have a bare-bone version of the app ready in two weeks.

Regardless of how easy it is to use, Whisper probably won't completely obsolete cloud-based services. One of the main features of traditional transcription services is being able to label who said what. Stage Whisper probably wouldn't support this feature.

The cloud is a computer that is owned by another person.

The benefits of local processing are not the only drawbacks. A professional transcription service uses computers that are more powerful than your laptop is. It took 52 minutes to read the whole file from the 24 minute interview I fed into Whisper. I made sure it was using the Apple Silicon version of Python. A transcript was spit out by the man in less than 8 minutes.

The price is one of the biggest advantages of Openai's tech. The cloud-based subscription services will almost certainly cost you money if you are using them professionally and upcoming changes will make it less useful for people who are transcribing things frequently. Stage Whisper is free and can be used on your computer.

I am very excited about what researchers end up doing with it or what they will learn by looking at the machine learning model that was trained on 680,000 hours of multilingual. It's exciting because it also has a practical use today.