Highlights:

  • Deepgram’s voice agents offer human-like responses using speech recognition and voice synthesis.
  • Deepgram’s system makes it easy for users to interact with AI using voice.

Deepgram Inc., the creator of a speech recognition engine delivered through APIs, recently unveiled a significant enhancement to its platform. Deepgram has introduced an API for AI voice conversations between AI agents and humans.

Deepgram’s voice agent systems leverage speech recognition and voice synthesis AI models to deliver human-like responsiveness. With this release, the company is introducing a system that combines all these elements into a single API.

Users need to set up a prompt and specify their desired task, and the system takes care of everything else. Previously, developers using Deepgram would have had to manually integrate different components, such as a large language model provider, the company’s voice-to-text speech recognition model, and the speech synthesis model.

“We have a big shift that’s happening in the world right now. AI went mainstream over the last two years and voice AI has gone mainstream over the last two to six months. There’s a fundamental shift around the nature of how work is going to be done,” Scott Stephenson, Co-founder and Chief Executive of Deepgram, said in an interview.

Deepgram’s system enables users to engage with AI-generated speech as if they were conversing with another human. It responds quickly and appropriately, waiting for the right moments to interject without disrupting the flow of conversation. The AI can be interrupted like a real person and maintains context throughout the dialogue, facilitating seamless interactions.

Stephenson noted that voice interactivity is applicable wherever there is a device with a microphone and speaker, including websites, phones, mobile devices, AI pendants, and even drive-throughs. A prime example of AI agents in action is within call centers, where agents can quickly answer calls, minimizing wait times for customers and allowing for prompt responses to questions or resolution of simple issues.

“If you can service a customer’s need without having them to talk to a live agent that can save costs and that leads to a very satisfied customer. If they can call in and they’re instantly connected with an AI agent and that agent can immediately ask questions, get information and get the conversation going, essentially filling out CRM information so that when a live agent is available now, they’re contextualized. Now they can complete their job in one minute,” said Stephenson.

Developers can select any large language model (LLM) they wish to integrate with the API, including those from OpenAI, Anthropic PBC, and Meta Platforms Inc. This flexibility allows them to customize the underlying AI experience to their preferences. Additionally, Deepgram offers 12 different voice synthesis options for customers to choose from.

“As we watch our children use their smartphones, it’s obvious that voice-to-voice will become a standard method of human and machine interactions. Deepgram’s Voice Agent API addresses this market opportunity and makes customer service — already a top use case for gen AI — easier by converting text conversations to speech. Deepgram also broadens the market opportunity by integrating with a wide array of large language models.” said Kevin Petrie, Vice President of Research at BARC US.

This year, we witnessed the launch of multiple large language models (LLMs) capable of providing natural voice conversation features. Notable examples include OpenAI’s GPT-4o, Google LLC’s Gemini Live, and Tenyx Voice from Tenyx Inc.

Stephenson explained that Deepgram isn’t limited to voice-to-voice interactions; it can also seamlessly integrate with text-to-voice, enabling users to maintain their privacy. For instance, Someone on a crowded train might prefer texting and listening to the response through their headset. He noted that while not everyone will want to engage in one-sided conversations with their devices, some individuals may enjoy lengthy discussions with AI models.

“The initial phase will be adding the voice option to text boxes. Once people realize you can have a human-like interruptible talking experience with a voice agent, we think that people will use it a lot,” Stephenson said.