RAG:
RAG stands for Retrieval Augmented Generation, it is the process of optimizing the
output of a large language model, so it references some personal knowledge base
outside of its training data sources before generating a response. Large Language
Models (LLMs) are trained on vast volumes of data and use billions of parameters to
generate original output for tasks like answering questions, translating languages,
and completing sentences. RAG extends the already powerful capabilities of LLMs to
specific domains or an organization's internal knowledge base, all without the need
to retrain the model. It is a cost effective approach to improving LLM output so it
remains relevant, accurate, and useful in various contexts.
RAG (Retrieval Augmented Generation) is an AI framework that combines the
strengths of traditional information retrieval systems (such as search and databases)
with the capabilities of generative large language models (LLMs).
Steps involved in RAG system:
1. Data preparation:
a. Raw data sources:
This is the initial stage where raw, unstructured data is collected from
diverse sources such as PDF documents, web pages, internal
databases, etc. These sources contain valuable domain specific
knowledge that is not necessarily part of a pre trained language
model’s internal parameters. This raw information is essential because
a RAG system depends on external knowledge to provide accurate and
grounded responses.
However, in its current state, this raw data is not directly usable by AI
models due to inconsistencies in formatting, structure, and content
types.
Note: In generative AI, grounding refers to the process of connecting a
large language model's (LLM) output to verifiable sources of
information, ensuring that the AI's responses are accurate, reliable, and
grounded in reality, rather than relying solely on its internal knowledge.
This is crucial for reducing "hallucinations" i.e. instances where the AI
makes up information.
b. Information Extraction:
Once the raw data is gathered, the next task is to extract useful
information from it. This involves using tools like OCR (Optical
Character Recognition) to digitize scanned documents, PDF parsers to
read and convert PDFs, web crawlers to scrape HTML content, and
other extraction tools for CSVs, images, or audio. The goal is to
standardize the content into a structured or semi structured plain text
format. For instance, metadata, headings, paragraphs, tables, and
images from a PDF report might be cleaned and rearranged into plain
text. This step ensures the information is clean, readable, and ready for
processing. Without this, the content might contain noise, repeated
patterns, or irrelevant characters that could reduce the quality of
embeddings and retrievals.
c. Chunking:
Chunking refers to separating text into manageable units.
After the data has been extracted and cleaned, it is split into smaller,
manageable units called chunks. This process is important because
modern AI models, including those used for embeddings and
generation (like BERT, OpenAI Embeddings, etc.), have token limits.
Chunking ensures that the data is broken down into semantically
meaningful segments such as paragraphs, sentences, or even
sections, depending on the context and the desired
granularity(granularity refers to the level of detail at which data is
stored and analyzed.). These chunks act as the atomic units for
storage and retrieval later in the pipeline. Importantly, good chunking
also considers the context window to maintain continuity across
segments. For example, blindly cutting every 100 words might break
sentences, so smarter algorithms use semantic or sentence aware
chunking strategies.
d. Embedding:
Each chunk is then passed through an embedding model, which
transforms it into a high dimensional vector representation. These
embeddings capture the semantic meaning of the text rather than just
its surface form. That means two chunks with different words but
similar meanings will have embeddings close to each other in the
vector space. For example, “How to reset my password?” and “Steps to
change my login credentials” would produce similar vectors.
These vectors are then stored in a vector database (e.g., FAISS,
Pinecone, Chroma), which allows efficient similarity search. This step is
crucial because it enables fast and accurate retrieval based on
meaning rather than keyword matching.
2. RAG Workflow:
a. Query:
This begins the retrieval phase. A user sends a natural language
question or request called the “query.” For example, someone might
ask, “What is the return policy on electronics?” This query, that looks
simple, can vary significantly in wording and requires contextual
understanding to be answered properly.
The raw query text is not useful on its own. It must be transformed to
match the vector space of the preprocessed knowledge base(the
vector database we created during data preparation). That leads us to
the next step.
b. Embedding the query:
Just like the data chunks we read previously, the user’s query is
passed through the same embedding model used earlier. This
generates a query vector that lives in the same high dimensional
semantic space as the stored data. This transformation ensures that
instead of relying on exact keywords, the system can retrieve results
that are semantically aligned with the user’s intent. This vector is now
ready to be used in a similarity search within the vector database to
find the most relevant data.
c. Vector Search:
The query vector is compared against the stored vectors in the vector
database using similarity metrics like cosine similarity or inner product.
The database returns the top n most relevant chunks that are closest to
the query vector. These retrieved pieces of information (e.g., top 5 or
top 10 chunks) are considered the most contextually relevant and are
assumed to contain the answer or background knowledge needed to
fulfill the user’s request. This is the “retrieval” part of Retrieval
Augmented Generation. These results are then sent forward for
synthesis.
d. Augmentation:
Now, the retrieved relevant chunks are combined with the original user
query to form an augmented prompt. This prompt typically includes
both the user’s original question and the retrieved context (in a format
like: “Context: [retrieved text] \n\n Question: [user query]”). This
augmented input is what is fed into the Large Language Model (LLM).
Because the LLM now has access to real world, specific information
(that may not have been part of its training set), it can generate more
accurate, up-to-date, and grounded answers. This step is critical in
preventing hallucinations, as the model no longer needs to guess or
make up answers because it has the context provided.
e. Generate Response:
Finally, the LLM processes the combined context and query and
generates a natural language response. Since it’s equipped with real
time, relevant knowledge from the retrieval step, the output tends to be
far more accurate and specific than responses generated by a
traditional, standalone LLMs. For instance, the system can respond
with: “According to the company’s electronics return policy, items must
be returned within 15 days of purchase with the original receipt.” This
response is tailored, context-aware, and reliable because it is grounded
in actual documentation retrieved earlier.