STS vs. STT + LLM + TTS: The Future of AI Voice Agents

STS vs. STT + LLM + TTS: The Future of AI Voice Agents

STS vs STT + LLM + TTS: the future of AI voice agents
STS vs STT + LLM + TTS: the future of AI voice agents

AI voice agents are evolving very fast. Most platforms still use a classic architecture based on STT → LLM → TTS, but the new speech-to-speech models are completely changing the way real-time conversations work.

At Diga, we have rebuilt our entire voice infrastructure to transition to a speech-to-speech architecture. In this article, we explain how AI voice agents work, the differences between STT/TTS and speech-to-speech, the impact on latency and naturalness, and why we believe this will be the standard for agencies and voice automations.

Old pipeline and new pipeline

Before diving into the challenges and why we made this decision, let's look at how our system worked before and what has changed.

STT, LLM, and TTS

Most AI voice agents today are made up of three models in a chain:

  • Speech to Text (STT) in voice AI systems receives the user's audio and converts it into text.

  • Large Language Model (LLM), the brain of AI conversational agents, receives the text, decides how to respond, and what actions to execute.

  • Text to Speech (TTS) in AI call automation takes the response from the LLM and converts it into audio.

All these components together form part of a system that allows building AI voice agents capable of holding conversations in real time in a way similar to how a person would.

Additionally, between the main models, there are other smaller pieces that complement the system: RAG (to answer based on documentation), date detectors, guardrails (to prevent the agent from saying something it shouldn't), among others. All these intermediate pieces will be important when we talk about the shift to speech to speech, because many of them stop working just as we know them.

STS

Now that we have seen the "traditional" way voice agents work, let's look at what Speech to Speech (STS) models consist of for AI voice agents.

Unlike what happened with what we explained earlier, in this case, there is a single piece, a single model that is responsible for handling the conversation. Speech to speech models are capable of directly receiving the person's audio and generating an audio response, without going through different stages.

It is like taking a language model and instead of teaching it to understand text and return text, teaching it to understand audio and generate audio. And not just audio: we can also add images and other inputs. That is where the word multimodality comes from, moving from one modality (text) to several (text, audio, image...).

Advantages of Speech-to-Speech models in AI voice agents

I mentioned earlier that the traditional pipeline allows having conversations in a way "similar" to how a human would. Similar because in each transition between models, information is lost.

When STT converts audio to text, it keeps only the words. But when we speak, tone matters as much as what we say. An "okay" said with enthusiasm is not the same as an "okay" said with resignation. The text is the same, but the meaning is completely different. All that information is lost before reaching the LLM, which receives impoverished text.

With TTS, something similar happens but in reverse. The LLM generates a text response and passes it to the voice model to convert it to audio. But the TTS knows nothing about the conversation: it doesn't know if the user is frustrated, if they are asking something urgent, or if they are joking. It just receives text and reads it with a fixed voice and style. As humans, we adapt our tone according to the situation; the TTS cannot do that.

In every step, something is lost. There are advancements (some LLMs already generate emotion tags that the TTS can interpret), but as long as the process is converting speech to text and then text to speech, we are always going to lose information along the way.

Speech to speech models eliminate those leaps. A single model receives the audio with everything: words, tone, pauses, background noise. And it generates an audio response with all that context. It listens, thinks, and responds just like humans do.

Furthermore, there is a second advantage: latency. In the traditional pipeline, each model needs its own processing time, and the three run in series. With a speech to speech model, a single inference replaces three, which significantly reduces the time between when the user finishes speaking and the agent responds. In voice agents, latency is one of the key factors of that "uncanny valley" we are constantly trying to overcome. An extra 500-600 ms can cut call success rates by more than half.

What problems arise when migrating an AI voice agent to Speech-to-Speech

Switching to speech to speech is not just about replacing one model with another. There are things we gain, but also things we lose.

We lose modularity

In the traditional pipeline, if we didn't like a model, we swapped it for another. We could choose more premium voices for one use case, faster models for another, try different STT or TTS providers depending on our needs. With speech to speech, everything depends on a single model, and therefore we lose flexibility.

Intermediate pieces stop working as-is

In the previous section, I mentioned that there are other smaller pieces between the main models. Many of them depended on having a text transcription available before reaching the LLM. With speech to speech, that transcription no longer exists at the same point in the process, forcing us to rethink how each one works.

RAG, for example, previously used the user's transcription to search for relevant info and pass it to the LLM (here we discuss in detail how our RAG worked in voice agents). Without prior transcription, we had to change the approach: now we use a function-based RAG, where the model itself decides when it needs to look up information once it receives the user's question.

Something similar happens with guardrailing. Guardrailing models are designed to analyze text and detect forbidden topics or sensitive information. Applying this in real-time over audio remains an open challenge.

And there is more: evaluations, redactng personal data, date detection, fallbacks when a provider is saturated...

But the most interesting case is the turn detector. This component is what tells the agent when to speak. In the traditional pipeline, a mini LLM analyzes the transcription and context to determine if the user has finished speaking. Without a transcription, we need models that work directly on the audio.

And interestingly, this brings us closer to how humans function. We do not decide when to speak based solely on what the other person says, but on how they say it: tone, intonation, pauses. An audio-based turn detector can catch those cues in a way a text-based one never could. In the next post, we will look at how we resolved this.

Speech to speech brings real challenges. It changes what we were used to doing and requires technology that is still in its early stages.

Why we are migrating to speech-to-speech in AI voice agents now

Speech-to-speech models for voice agents had already appeared before. OpenAI's have been available for months. But Google's model, Gemini 3.1 Flash Live, is the first where three things come together that previously did not match: naturalness, capability, and price.

In naturalness, the model sounds good. It is not robotic, although it still lacks the voice customization offered by providers like ElevenLabs. In capability, before making the switch, we evaluated the model in depth using this eval. The result: at its minimum thinking level, it already outperforms the GPT Realtime model, and at the medium level, it matches GPT Realtime 2. And in price, it is between 7 and 8 times cheaper.

This intersection of factors is what made us take the leap. There are still parts that are not fully mature and that require work in the field of voice models. But that is why we exist as a platform: so you can focus on building the best agents while we make sure the technology underneath is always up to date.

The future of voice agents

There is still a way to go before conversations with an agent feel like talking to a person. There are things we continue to improve. For example, when you talk to someone, the person listening does not stay completely silent until you finish. They nod, say "yes, yes," interrupt when they have something relevant to add, and know when it's their turn. A natural conversation is not a succession of turns. It is a fluid exchange where both parties participate at the same time.

This is what full-duplex models are about: they are capable of listening and speaking at the same time. A few weeks ago, Thinking Machines (the lab of OpenAI's former CTO) launched its Interaction Model, which addresses exactly this. Nvidia has PersonaPlex, and before that was Moshi by Kyutai. These models sound very natural, but their intelligence level is not yet sufficient to execute tasks and run in production.

These are many changes and it was not an easy decision. It implies rethinking our architecture and working with technology for which frameworks are not yet prepared. But we are convinced that this is the path for the next generation of AI voice agent platforms for agencies, and we prefer to lead these changes rather than follow them.

Subscribe to Diga's newsletter

Receive our newsletter with real insights, practical strategies, and updates about voice agents.

Subscribe to Diga's newsletter

Receive our newsletter with real insights, practical strategies, and updates about voice agents.

Subscribe to Diga's newsletter

Receive our newsletter with real insights, practical strategies, and updates about voice agents.