The explosion in audio and video content and interface over the last few years has been plain to see, but ways of dealing with all that media behind the scenes hasn't quite caught up. Assemblyai, powered by $28M in new funding, is aiming at becoming the go-to solution for analyzing speech, offering ultra-simpleAPI access for transcribing, summarizing, and otherwise figuring out what's going on in thousands of audio streams at a time.

In a short time, multimedia has become the standard for so many things: phone calls and meetings became video calls, social media posts became 10-second clips, and chatbots learned to speak and understand speech. People need to be able to work with the data those applications produce in order to run them well or build something new on top of them.

Audio isn't easy to work with. How do you listen to an audio stream? If you want to look at it, you could scrub through it, but more likely you'll want to look at the text first. It is not often easy to integrate transcription services into your own app or enterprise process.

If you want to do content moderation, search, or summarize audio data, you have to convert it into a format that is more flexible, and that you can build features and business processes on top of. People need a lot of help to build these features, but they don't want to glue a bunch of providers together.

AssemblyAI has a number of different APIs that you can use to do things like identify the speakers in a conversation, or check out prohibited content.

Examples of code being used to call Assembly AI's API.

Code it, call it done.

I was skeptical that a single small company could produce working tools to accomplish so many tasks, considering how complex those tasks are once you get into them. Fox said that the tech has come a long way in a short time.

Over the last few years, there has been a rapid increase in accuracy in these models. We're pushing the state of the art because we're one of the few companies doing large scale deep learning research. We are going to spend over a million dollars on compute and graphics in the next few months.

It can be hard to grasp because it isn't easy to demonstrate, but language models have come along and are just as useful as computer vision and image generation. Fox pointed out that understanding and generating the written word is a different research domain than analyzing conversation and casual speech. Although the same advances in machine learning techniques have contributed to both, they are like apples and oranges.

The result has been that it is possible to perform moderation or summarizing processes on an audio clip a few seconds or an hour long, simply by calling the API. If you expect a hundred thousand clips to be uploaded every hour, what is your process for a first pass? How long will it take to build that process?

Fox hopes that companies in this position will look for an easy and effective way forward if they were faced with adding a payment process. You could either build one from scratch or add Stripe in 15 minutes. It clearly separates them from the more complex, multi-service packages that define audio analysis products by big providers like Microsoft and Amazon.

The Fox in question.

The Fox is in question. Jens Panduro is the image credit.

The company has tripled revenue in the last year and now processes a million audio streams a day. There is a huge need and a huge market, and the spend from customers is there.

The $28M A round was led by Accel, with participation from Y Combinator, John and Patrick Collison, Nat Friedman, and Daniel Gross. In the next few months, the company is spending a million dollars on a bunch of A 100s that will power the incredibly computation-intensive research and training processes. If you are stuck paying for cloud services, it is better to rip that Band-Aid off early.

I suggested that they might have a hard time getting people to join because they are competing with the likes of Facebook and Google. Fox believed that the culture there can be slow.

He said there is a desire in good researchers and engineers to work on the bleeding edge.

10 investors discuss the no-code and low-code landscape in Q1 2022