While the tech giants are wrestling with their speech AI products, Deepgram seems to be delivering useful products for developers in a variety of applications. Following my last post about the company comes this news about speech-to-speech technology without intermediate text.
Deepgram announced a significant technical achievement in speech-to-speech (STS) technology for enterprise use cases. The company has successfully developed a speech-to-speech model that operates without relying on text conversion at any stage, marking a pivotal step toward the development of contextualized end-to-end speech AI systems. This milestone will enable fully natural and responsive voice interactions that preserve nuances, intonation, and emotional tone throughout real-time communication. When fully operationalized, this architecture will be delivered to customers via a simple upgrade from our existing industry-leading architecture. By adopting this technology alongside Deepgram’s full-featured voice AI platform, companies will gain a strategic advantage, positioning themselves to deliver cutting-edge, scalable voice AI solutions that evolve with the market and outpace competitors.
Existing speech-to-speech (STS) systems are based on architectures that process speech through sequential stages, such as speech-to-text, text-to-text, and text-to-speech. These architectures have become the standard for production deployments for their modularity and maturity, but eliminating text as an intermediary offers opportunities to improve latency and better preserve emotional and contextual nuances.
Meanwhile, multimodal LLMs like Gemini, GPT-4o, and Llama have evolved beyond text-only capabilities to accept additional inputs such as images, videos, and audio. However, despite these advancements, they struggle to capture the fluidity and nuance of human-like conversation. These models still rely on a turn-based framework, where audio input is tokenized and processed within a textual domain, restricting real-time interactivity and expressiveness.
To advance the frontier of speech AI, Deepgram is setting the stage for end-to-end STS models, which offer a more direct approach by converting speech to speech without relying on text. Recent research on speech-to-speech models, such as Hertz and Moshi, has highlighted the significant challenges in developing models that are robust and reliable enough for enterprise use cases. These difficulties stem from the inherent complexities of modeling conversational speech and the substantial computational resources required. Overcoming these hurdles demands innovations in data collection, model architecture, and training methodologies.
Deepgram is transforming speech-to-speech modeling with a new architecture that fuses the latent spaces of specialized components, eliminating the need for text conversion between them. By embedding speech directly into a latent space, Deepgram ensures that important characteristics such as intonation, pacing, and situational and emotional context are preserved throughout the entire processing pipeline. What sets Deepgram apart is its approach to fusing the hidden states—the internal representations that capture meaning, context, and structure—of each individual function: Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS). This fusion is the first step toward training a controllable single, true end-to-end speech model, enabling seamless processing while retaining the strengths of each best-in-class component. This breakthrough has significant implications for enterprise applications, facilitating more natural conversations while maintaining the control and reliability businesses require.
One of the requirements in enterprise speech-to-speech modeling is the ability to understand and troubleshoot each step of the process. This is particularly challenging when text conversion between steps isn’t involved, as verifying both the accuracy of the initial perception and the alignment of the spoken output with the intended response is not straightforward. Deepgram recognized this need and addressed it by designing a new architecture that enables debuggability throughout the entire process.