Cookie Consent
Hi, this website uses essential cookies to ensure its proper operation and tracking cookies to understand how you interact with it. The latter will be set only after consent.
Read our Privacy Policy

The ELI5 Guide to Retrieval Augmented Generation

Discover the inner workings of Retrieval Augmented Generation (RAG) and how it enhances language model responses by dynamically sourcing information from external databases.

Blessin Varkey
December 1, 2023
November 16, 2023
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

In-context learning

As users increasingly rely on Large Language Models (LLMs) to accomplish their daily tasks, their concerns about the potential leakage of private data by these models have surged.

[Provide the input text here]

[Provide the input text here]

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Suspendisse varius enim in eros elementum tristique. Duis cursus, mi quis viverra ornare, eros dolor interdum nulla, ut commodo diam libero vitae erat. Aenean faucibus nibh et justo cursus id rutrum lorem imperdiet. Nunc ut sem vitae risus tristique posuere.

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Lorem ipsum dolor sit amet, line first
line second
line third

Lorem ipsum dolor sit amet, Q: I had 10 cookies. I ate 2 of them, and then I gave 5 of them to my friend. My grandma gave me another 2boxes of cookies, with 2 cookies inside each box. How many cookies do I have now?

Title italic Title italicTitle italicTitle italicTitle italicTitle italicTitle italic

A: At the beginning there was 10 cookies, then 2 of them were eaten, so 8 cookies were left. Then 5 cookieswere given toa friend, so 3 cookies were left. 3 cookies + 2 boxes of 2 cookies (4 cookies) = 7 cookies. Youhave 7 cookies.

English to French Translation:

Q: A bartender had 20 pints. One customer has broken one pint, another has broken 5 pints. A bartender boughtthree boxes, 4 pints in each. How many pints does bartender have now?

Hide table of contents
Show table of contents

Large language models (LLMs) like GPT, Llama, and others have revolutionized how we interact with technology, providing sophisticated answers to our questions. Companies worldwide are integrating these advanced models into their workflows to enhance operations.

But these models aren't perfect. They can sometimes give wrong answers, miss crucial details, or lose the context.

That's where Retrieval Augmented Generation, or RAG, becomes essential.

RAG is a technique that enriches LLMs with more accurate and context-aware information.

In this guide, we'll explore how RAG enhances LLMs and why it's important for providing reliable responses when using LLMs in business or specialized areas.


  • What is Retrieval Augmented Generation?
  • How RAG Works
  • Advantages and Challenges of RAG
  • RAG in Real-World Use

What is Retrieval Augmented Generation?

Retrieval Augmented Generation, or RAG, is an enhancement to the way large language models process and generate text. First, let's talk about the foundation of these language models: the Transformer architecture introduced in 2017 by Vaswani and colleagues at Google.

Transformers have a unique 'self-attention' mechanism that understands context by considering the relationship between all words in a sequence.

Transformers can analyze and relate all words in a sequence simultaneously.

Take the word "crane" in different contexts:

  • The crane lifted heavy cargo.
  • The crane spread its wings.

The Transformer distinguishes between "crane" as a lifting machine and "crane" as a bird in different sentences. Earlier models could not make this distinction well, as they read words in order and failed to see the full context.

Today, Large Language Models (LLMs) like these are used in many fields. They help manage medical records, assist in drug discovery, detect financial fraud, and analyze sentiments in financial news. Their adaptability and performance are valuable across various industries.

But LLMs have limitations. They are pre-trained on set data and cannot update it, which can lead to outdated or incorrect responses, and sometimes even fabricated information often called “hallucinations.”

This is where RAG comes in.

It combines the language understanding of LLMs with an external information retrieval system.

This means the model can access the most current information, like referencing the latest documents or data to inform its responses.

Imagine a student taking a test with the ability to look up answers in a textbook or online, rather than relying solely on memory.

RAG operates similarly, resulting in more accurate, up-to-date, and relevant outputs. This technology reduces errors and improves overall performance, making LLMs even more effective.

How does Retrieval-Augmented Generation work?

Retrieval-Augmented Generation (RAG) is an approach that enhances natural language processing tasks.

It does so by combining two distinct models: a retriever and a generator.

The 2021 paper by Lewis et al., titled "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks," popularized this concept.

It built on an earlier paper by Guu et al., which introduced the concept of integrating knowledge retrieval during a model's pre-training stage.

Source: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Let's break down how these two models work together:

Retriever vs Generator

The Retriever Model

This part of RAG is designed to pinpoint relevant information within a vast dataset. Using advanced techniques known as dense retrieval, the retriever creates numerical representations—called embeddings—of both queries and documents. It places similar queries and documents near each other in a high-dimensional space.

When a query comes in, the model uses semantic search methods, like cosine similarity, to identify and deliver the most contextually fitting documents. The strength of this model lies in its precision; it excels at quickly finding the exact information required from a large pool of data.

For illustration, consider an image that shows the word embeddings for terms like "king," "queen," "man," and "woman" in a three-dimensional space, demonstrating how semantically related terms cluster together.

The image shows the word embeddings for “king”, “queen”, “man”, and “women” in a 3-D space. Source.

The Generator Model

After the retriever finds the relevant data, the generator takes over. This component crafts coherent and contextually aligned responses. Built typically on transformer architectures, the generator uses the provided context to create responses that are not only grammatically correct but also factually accurate.

The generator's forte is in generating completely new content, which is particularly useful for creative tasks or in developing conversational agents like chatbots.

The retriever model serves as the RAG system's semantic search engine, sourcing documents that semantically align with the query.

This synergy between the retriever and generator makes RAG particularly powerful for producing quality responses informed by large amounts of data.

Now, before we move on to explaining the RAG architecture, let’s also have a quick look at the semantic search.

Semantic Search

When managing a website or e-commerce site with a vast array of content or products, standard keyword searches may fall short.

They rely on matching specific words in a query, often leading to results that miss the context or intent behind the search. Semantic search improves upon this by grasping the query's meaning, fetching content that is relevant in meaning, not just in word match.

Take this scenario:

You type "Entry-level positions in the renewable energy sector" into a search bar. A basic keyword search might display pages containing "entry-level," "positions," "renewable," "energy," and "sector." But this doesn't mean you'll find job listings. 

Instead, you might see educational articles or industry news.

With semantic search, however, the system understands you're seeking job openings in the renewable energy field at an entry-level.

It then presents specific job listings such as "Junior Solar Panel Installer" or "Wind Turbine Technician Trainee," directly answering your search intent. Semantic search connects the dots between words and meanings, providing you with results that matter.

Cosine Similarity

Cosine similarity is a metric that evaluates how similar two documents are, regardless of their size.

This method calculates the cosine of the angle between two non-zero vectors in a multi-dimensional space—these vectors represent the text content.

It's a tool that proves highly effective in semantic searches, which focus on finding material that shares meaning with the query, not just identical keywords.

Visual representation of cosine similarity. The image is a 3D plot with three axes, each representing a different topic: McDonald's, Popeyes, Mixed documents, and Mona Lisa. There are four documents plotted in this space: one related to McDonald's, Mona Lisa, one related to Popeyes, and one that's a mix of both topics. The distance between these points is calculated using cosine similarity, which measures how related the documents are to each topic based on the angle between them. The closer the documents are to each other, the more similar they are. The document about Mona Lisa is plotted far from the other two, showing that it's a different topic altogether. Source 

Retrieval Augmented Generation Architecture Explained

Setting up a RAG system involves fine-tuning two main components: the retriever and the generator. These work concurrently to identify relevant documents for a query and to craft precise answers.

Document Database Preparation: Initially, a vector database is established to house articles. Long articles are divided into manageable sections because language models have processing limits. These sections are converted into vectors, or numerical representations, and stored for fast retrieval.

Generalized representation of a simple RAG system by the author. Step 1: It starts with a user submitting a query or a prompt, which is the input for the system. Step 2: The query is then transformed into a numerical vector using an embedding model. This vector is a condensed numerical representation of the query. Step 3: Using this vector, the system accesses a vector database—a type of database optimized for high-speed vector calculations and retrieval—to run a retrieval algorithm. The vector database quickly scans through extensive datasets to locate documents or text snippets that match the query's vector. Step 4: The documents or snippets identified by the retrieval algorithm are those most relevant to the query, containing information that potentially answers the user's question. Step 5: These retrieved documents are encoded into vectors using the same embedding model. The result is a collection of vectors that capture the context of the retrieved information. Step 6: The system combines the vectors of the original query and the retrieved documents. Step 7: With the context established, a foundation model, such as GPT-3 or GPT-4, synthesizes a coherent and contextually informed response to the query. Step 8: Finally, the system delivers the generated answer to the user. The goal is to provide a response that is as accurate and relevant as possible. Source: Author

Query Processing: A user's question is transformed into a vector, enabling the RAG system to grasp the meaning and search for corresponding content in the document database.

Relevant Information Retrieval: The retriever searches the database with the query's vector to find closely related document sections. This is achieved by calculating similarity based on the "distance" between the question vector and the vectors of documents in the database.

Answer Generation: The generator receives the query alongside the most relevant sections from the documents. Leveraging this information, it generates a coherent and contextually appropriate response. Effective prompt engineering is essential to guide the language model toward more accurate outcomes.

For those aiming to develop a RAG-based application, pre-built language models from platforms such as HuggingFace can be utilized. These platforms offer necessary tools, including vector database options.

Improving system precision is feasible with well-crafted prompts.

Retrieval-Augmented Generation: Pros and Cons

Understanding the benefits and challenges of Retrieval-Augmented Generation (RAG) is critical for those using or developing large language models (LLMs).

Advantages of RAG

  • Cost-effective Training: Unlike intensive fine-tuning processes, RAG requires less computational power and data. You just need to index documents into the knowledge base.
  • Access to Various Knowledge Sources: RAG merges the knowledge within its own parameters with that from external databases. This results in more accurate answers and lessens incorrect creations, especially in tasks like question-answering.
  • Enhanced Scalability: RAG is adept at handling large datasets and intricate inquiries thanks to vector databases. It surpasses conventional LLMs, which are constrained by their context window size, by retrieving information from a broader range.

Challenges Faced by RAG

  • Risk of Hallucinations: Even RAG can make mistakes. If the database lacks certain information, the model might guess the response, which can lead to inaccuracies.
  • Managing Scalability: While RAG handles large databases well, increasing the database size can complicate quick and efficient data retrieval.
  • Potential Biases in Data: Biases in the retrieval database can taint the responses, raising ethical concerns. Fortunately, startups like Lakera are developing tools to spot and lessen biases in AI systems. By using these tools, developers and researchers can improve their models for fairer outputs.

All in all, RAG's methodology presents both notable advantages and distinct challenges. 

While it economizes on resources and enhances performance, it must also contend with potential inaccuracies and the complexity that comes with scale.

As the AI field advances, tools to address RAG's challenges will likely improve, making it an even more reliable approach to augmenting LLMs.

Real-World Uses of Retrieval Augmented Generation Technology

Retrieval Augmented Generation (RAG) technology enhances various industries by improving how information is located, processed, and utilized. Here are some practical applications across sectors:

Enhanced Search Outcomes

RAG technology enriches search results by pairing with external databases. This process is especially valuable in healthcare for examining Electronic Medical Records (EMRs) or finding clinical trials. RAG pulls up-to-date, detailed information that is critical for patient care.

Interactive Data Conversations

Users can interact with databases using natural language thanks to RAG. This "Talk to your data" approach simplifies complex data interactions, making it user-friendly for non-technical stakeholders to query databases directly.

Advanced Customer Support Chatbots 

RAG-equipped chatbots elevate the support experience across various industries. These chatbots tap into extensive databases to give precise responses to customer inquiries. This is indispensable in IT for coding-related issues or in manufacturing for pinpointing production errors.

Summarization for Efficiency 

Summarizing large volumes of data becomes streamlined with RAG, making the information more digestible. In education, this could enhance activities like grading essays or creating condensed study materials.

Data-Driven Decision Making 

RAG aids decision-making by identifying patterns and insights within large datasets. In finance and the legal field, RAG helps draft contracts and condense regulatory documents. Access to current information is essential for accurate decisions in these sectors.

By integrating retrieval-augmented generation with these functions, professionals can leverage accurate, up-to-date information to deliver better outcomes in their fields.

Understanding Retrieval Augmented Generation: TL;DR

Retrieval Augmented Generation (RAG) enhances the output of foundational language models. It does this by adding an external retrieval system. This system helps to create responses that are more accurate and suited to the context.

Foundation models have a wide range of knowledge, but they learn from data that doesn’t change. Because of this, the model might generate outdated or incorrect information.

RAG offers a solution. It includes a retrieval component that uses dynamic, external data sources. This means it can offer more relevant information in response to a query.

How does RAG work?

  • The retriever: It searches a large data set to find relevant info for the query.
  • The generator: It then takes this info and crafts the final response.

The retriever isn't just about matching keywords. It employs semantic search techniques, like cosine similarity. This finds documents that share the same ideas as the query, beyond just similar words.

To do this, the retriever takes the text of both the query and potential source documents and turns them into embeddings, which are numerical representations. These let us compare how similar they are in a concept space, even if the words used are different.

Then, the generator, which is a language model, uses the context from the retriever. It adds that to its existing knowledge to put together coherent and accurate answers.

RAG is useful in areas like law, retail, healthcare, and finance. These are areas where having the latest and most precise information is critical.

RAG has several benefits. It can reduce the need for extensive training, pull from a variety of knowledge bases, and it's scalable.

However, it can also have issues. These include generating convincing but incorrect information (hallucinations), scaling complexities, and biases from the data it pulls from.

Lakera LLM Security Playbook
Learn how to protect against the most common LLM vulnerabilities

Download this guide to delve into the most common LLM security risks and ways to mitigate them.

Unlock Free AI Security Guide.

Discover risks and solutions with the Lakera LLM Security Playbook.

Download Free

Master Prompt Injection Attacks.

Learn LLM security, attack strategies, and protection tools. Includes bonus datasets.

Unlock Free Guide

Learn AI Security Basics.

Join our 10-lesson course on core concepts and issues in AI security.

Enroll Now

Optimize LLM Security Solutions.

Use our checklist to evaluate and select the best LLM security tools for your enterprise.

Download Free

Uncover LLM Vulnerabilities.

Explore real-world LLM exploits, case studies, and mitigation strategies with Lakera.

Download Free

Understand AI Security Basics.

Get Lakera's AI Security Guide for an overview of threats and protection strategies.

Download Free

Explore AI Regulations.

Compare the EU AI Act and the White House’s AI Bill of Rights.

Download Free
Blessin Varkey
Read LLM Security Playbook

Learn about the most common LLM threats and how to prevent them.

You might be interested
min read
Machine Learning

Test machine learning the right way: Fuzz testing.

In this instance of our ML testing series, we discuss fuzz testing. We discuss what it is, how it works, and how it can be used to stress test machine learning systems to gain confidence before going to production.
Lakera Team
December 1, 2023
min read
Machine Learning

Test machine learning the right way: Detecting data bugs.

In this second instance of the testing blog series, we deep dive into data bugs: what do they look like, and how can you use specification and testing to ensure you have the right data for the job?
Mateo Rojas-Carulla
December 1, 2023
untouchable mode.
Get started for free.

Lakera Guard protects your LLM applications from cybersecurity risks with a single line of code. Get started in minutes. Become stronger every day.

Join our Slack Community.

Several people are typing about AI/ML security. 
Come join us and 1000+ others in a chat that’s thoroughly SFW.