RAG in voice agents: discard it or implement it

RAG in voice agents: discard it or implement it

RAG in voice agents
RAG in voice agents

For a long time, I have been a big hater of RAG in voice agents. When it comes to systems that need a near real-time response, RAG systems, in general, usually don't perform very well. That's why, when we considered developing a RAG system on our platform some time ago, I clearly said no. In the end, we did end up implementing it, and in this article I want to talk about RAG systems, latency limitations in Voice AI, and the approach we finally took to incorporate a RAG system that would be useful.

What is RAG

First of all, RAG (Retrieval-Augmented Generation) is a system that consists of retrieving information from external sources (PDFs, websites, and generally anything that contains text) and passing it as context to an LLM so that it generates answers about a specific domain on which it has not been trained. Instead of depending only on what the model already knows, we give it the information it needs at any given moment. If you want to dive deeper, I recommend this video where they explain it in detail.

My first contact with RAG was already two years ago, in a DeepLearning.ai course where they taught you how to set it up with LlamaIndex. Since then, the world of RAG has changed a lot. New techniques have come out that make it work much better. In fact, if you go on LinkedIn, you'll find a new one every day ;). However, when it comes to voice AI, all these techniques fall apart due to several problems.

The Problems of RAG in Voice AI

When we talk through a chat, everything is simpler. New RAG techniques work because in the chat world we can apply different techniques and tricks and make the user wait while a response is returned. The user doesn't mind waiting another 2-3 seconds if that gets them the correct answer.

However, with voice, things change. In real life, when you talk to a person, you expect them to respond instantly. It would be very weird if in every interaction they stood there thinking for a few seconds before answering you, and in voice AI that is exactly what we have to try to mimic. Every 100ms counts (literally) and we can't afford to spend a while looking up information.

The problem is that RAG systems in 2025 are converging towards agentic architectures with self-reflection. Agentic RAG, GraphRAG..., systems that work very well, but have a latency that is too high for voice. That leaves us with the more traditional RAG approach: based on the user's question, we get some text fragments and pass them to the LLM.

But there is another problem that is not usually mentioned: transcriptions. Today, voice AI systems go through a three-step flow (we'll talk about multimodal models another day):

STT (speech to text) → LLM (generates response) → TTS (text to speech)

During the STT step, what the user said is often not captured correctly. For example, "my shoulder hurts" is not the same as "my shoulder herds". An LLM would probably understand from context that the user meant "shoulder", but when we use that transcription as is to search in the RAG, "my shoulder herds" returns completely different results. The LLM knows how to interpret transcription errors; vector search does not.


What Approach We Have Followed

Contrary to my thoughts regarding RAG when we first talked about whether to implement RAG in Diga, today I have a somewhat different view. After analyzing the problem, we have reached a solution that, to some extent, puts an end to some of this system's problems.

The Processing

We often talk about RAG as something that only happens when the user asks a question, but one of the most important parts is what happens before, which is the part of how we process the documents.

Every document is its own world: some have tables, others have separate sections that should go together, others are websites with a completely different format... We can't just cut the text into blocks of 500 words and expect it to work. It's like cutting a book in half on each page, you would constantly lose the thread. Our system understands the structure of the document (titles, sections, tables) and cuts respecting it, so that each fragment makes sense on its own.

With websites, things get more complicated. Nowadays, many pages render content with JavaScript, so a simple "download the HTML" doesn't work. We need a real browser that renders the page, scrolls, and even clicks on the accordions to extract all the content.

The Rewriter

As I mentioned before, the text we get from the STT often cannot be used directly for searching. Sometimes the user says "shoulder" and the STT understands "herds", other times they say "tell me more about that" and we need to know what "that" is.

This is where the rewriter comes in. It is a small model that takes the last messages of the conversation and generates an improved query. If they have been talking about a "partner program" (by the way, we still have spots left in ours, message us at contact@diga.io if you're interested!!) and the user asks "and what are the prices for that?", the rewriter generates something like "prices for the partner program". A contextualized query that also avoids problems like the "herds" one, since it knows what is being talked about.

Moreover, if the user simply says "hello" or "thank you", the rewriter detects it and saves us an unnecessary search. All of this with a very small model that prevents latency from going up by more than 150-200ms.

The Search

The search combines two approaches in parallel. A semantic search (which understands the meaning) and a lexical search (which looks for exact word matches). It's like having two people searching: one who understands what the question is about and another who looks for the exact words. Between the two, you cover many more cases.

From there we get 10 candidates which then go through a reranker, a second model that scores each fragment in relation to the query, and we keep the 5 most relevant ones.

The Limits of RAG (and Ours)

Our system is not infallible; there are still cases where it does not behave as we expect. But we believe it significantly improves upon what is done today, with one basic condition: that the conversation still feels natural.

That being said, there is a question I ask myself very often: are we sure this should be a RAG?

Sometimes we try to force things into RAG that make no sense there. One case that comes to mind is looking up product references. Imagine a file with thousands of lines like these:

Reference

Description

Price

B12719HLS

Precision screw | 18.5mm

€0.12

A83421TRX

Lock washer | 12.0mm

€0.08

We upload it and expect the RAG to return the correct result instantly. But think about it: if you asked a person for this, the first thing they would do is ask you to repeat each letter and then look it up in a database. If a person looks it up in a database, why not ask our agent to do the same through a function call?

On the other extreme, sometimes RAG is overkill for what you need. If you want the agent to know a restaurant's menu or some basic ground rules of behavior (guardrails), many times it makes more sense to put it directly in the model's prompt.

In general, I think we have to recognize that there are places where RAG is not the best option. And it's also that maybe the answer is not "RAG yes or RAG no", but rather knowing how to distinguish what kind of search each question needs. In fact, that is a job we must do ourselves: our goal must be for the platform to be able to automatically classify the different contents and decide where each one should go, so that you simply worry about uploading your information and that's it. Until then, you, as integrators, are the ones who must decide where each piece of information should go.

So even though at the beginning I said no to implementing RAG in voice AI, I have ended up writing a post about how we have implemented it. Things change when you sit down to solve problems instead of dismissing them.

This is the first entry of what we hope will be many, where we will share what we run into day-to-day and some of the debates we have at Diga.

Subscribe to Diga's newsletter

Receive our newsletter with real insights, practical strategies, and updates about voice agents.

Subscribe to Diga's newsletter

Receive our newsletter with real insights, practical strategies, and updates about voice agents.

Subscribe to Diga's newsletter

Receive our newsletter with real insights, practical strategies, and updates about voice agents.