What Is RAG System? | TMA Glossary

What Is a RAG System?

RAG (Retrieval-Augmented Generation) is an architecture pattern that connects large language models to your private data without retraining the model.

Here’s the problem RAG solves: LLMs are smart, but they only know what they were trained on. Ask ChatGPT about your company’s internal policies, customer support history, or proprietary research, and it draws a blank. You can’t fine-tune a model every time your data changes. That’s expensive, slow, and breaks the moment you add new information.

RAG fixes this. Instead of stuffing everything into the model’s training data, you store knowledge in a searchable database. When someone asks a question, the system retrieves relevant information first, then feeds it to the LLM along with the query. The model generates an answer grounded in your actual data, not hallucinated from thin air.

Why RAG matters for enterprise: Most Fortune 500 companies won’t send proprietary data to external APIs for fine-tuning. RAG lets you keep data in your infrastructure while still getting LLM-powered insights. Your documents stay in your vector database. Your embeddings run in your environment. Zero data leakage.

The pattern is simple: Retrieve the right context, augment the prompt with that context, generate a response. That’s it. But the implementation details separate working pilots from production systems that actually save money.

How RAG Works: The 3-Step Process

Every RAG system follows the same basic flow, whether you’re building customer support agents or internal knowledge retrieval.

Step 1: Retrieve

When a user asks a question, you don’t send it straight to the LLM. You search your knowledge base first.

The process:

Convert the user’s question into a vector embedding (a numerical representation)
Search your vector database for similar embeddings (semantic search)
Rank results by relevance score
Pull the top N most relevant chunks (usually 3-10)

Example: User asks “What’s our return policy for damaged items?” The system searches embeddings of your knowledge base and retrieves the 5 most relevant policy documents about returns, damage claims, and refund procedures.

What makes retrieval hard: Not all content chunks are created equal. A 500-word document split into 100-word chunks might separate critical context. Retrieval quality depends on how you chunk, what metadata you attach, and whether you rerank results before sending them to the LLM.

Step 2: Augment

Take the retrieved documents and stuff them into the LLM’s context window along with the original question.

The prompt structure:

Context: [Retrieved documents 1-5]
Question: What's our return policy for damaged items?
Instructions: Answer based only on the provided context. If the answer isn't in the context, say so.

You’re not changing the model. You’re changing the input. The LLM sees both the question and the supporting evidence at the same time.

Why this works: LLMs are excellent at reasoning over provided information. They struggle when they have to recall training data from months ago. RAG gives them fresh, specific context for every query.

Step 3: Generate

The LLM reads the retrieved context and generates a response. Because the answer is grounded in real documents, hallucination rates drop dramatically.

What you get:

Answers based on your actual data, not the model’s training set
Source citations showing which documents informed the response
The ability to update knowledge without retraining (just update the vector database)

Production detail: Most enterprise RAG systems include a confidence score with each response. If retrieval scores are low, the system says “I don’t have enough information” instead of making something up. That’s the difference between a demo and a system you can trust with customer-facing queries.

Why AI Teams Need RAG: The Business Case

If you’re building AI agents that interact with company-specific knowledge, you need RAG. Here’s why.

Data Sovereignty

Fine-tuning sends your data to third-party APIs. RAG keeps it in your infrastructure.

For Fortune 500 companies with compliance requirements, this isn’t optional. You can’t send customer PII, financial records, or proprietary research to OpenAI’s servers for model training. But you can store embeddings in your own vector database and query them locally.

The deployment pattern: Run embeddings on-premise or in your VPC. Store vectors in Qdrant, Weaviate, or Postgres with pgvector. Query LLMs via API, but never send raw documents through the wire. Only send the question and retrieved chunks (which you control).

Speed to Production

Fine-tuning a model takes weeks. Training from scratch takes months. RAG deployments can go live in days.

Real-world timeline: Working RAG pilot in one week or less for most use cases. Chunk your documents, generate embeddings, set up vector search, wire it to an LLM. Done. Production hardening (reranking, metadata filtering, hybrid search) takes another 2-4 weeks depending on data complexity.

Compare that to fine-tuning, where you need labeled data, multiple training runs, evaluation datasets, and model deployment infrastructure. By the time a fine-tuned model is ready, your data has changed and you’re starting over.

Cost Efficiency

RAG is cheaper than fine-tuning at enterprise scale.

The math: Fine-tuning costs $10K-50K+ per model depending on data size and iteration cycles. RAG uses pre-trained models with retrieval overhead. Embedding generation is a one-time cost. Vector search is fast and cheap. LLM inference costs are the same whether you use RAG or fine-tuning.

Where RAG wins: You update knowledge by adding documents to your database, not by retraining models. That’s a database insert, not a GPU cluster.

Measurable ROI

RAG ties directly to hero metrics that move P&L.

Use cases with clear ROI:

Customer support: 60-80% reduction in response time, 40% reduction in support tickets escalated to humans
Internal knowledge retrieval: 35% reduction in time spent searching documentation
Research analysis: 70% faster document review for due diligence, compliance, legal discovery
Data processing: 50% reduction in manual data entry and classification tasks

These aren’t vanity metrics. They’re hours saved and costs reduced. That’s the AI with ROI promise.

RAG Architecture Patterns: Basic to Advanced

Not all RAG systems are built the same. Here’s how the architecture evolves from prototype to production.

Basic RAG: The Starting Point

Components:

Document chunker (split text into 500-1000 word segments)
Embedding model (text-embedding-ada-002 or open-source alternatives)
Vector database (Qdrant, Weaviate, Pinecone, pgvector)
LLM (GPT-4, Claude, or open-source models)

Flow:

Chunk documents and generate embeddings
Store embeddings in vector database with metadata
On query: embed question, retrieve top-k chunks, send to LLM
Return LLM response with source citations

When this works: Small knowledge bases (under 10K documents), low query volume, internal tools where occasional wrong answers aren’t critical.

When this breaks: Large document sets where simple semantic search returns irrelevant chunks. High-stakes use cases where accuracy matters. Complex queries that require reasoning across multiple documents.

Advanced RAG: Production-Grade

Additional components:

Hybrid search (combine vector similarity with keyword matching)
Reranking models (Cohere Rerank, cross-encoders)
Metadata filtering (date ranges, document types, access controls)
Query transformation (rewrite vague questions into specific searches)
Answer validation (check if retrieved context actually supports the answer)

Flow:

Transform user query for better retrieval
Run hybrid search (vector + keyword)
Filter by metadata (user permissions, date relevance)
Rerank top 50 results to find best 5
Validate retrieved chunks contain answer-relevant information
Generate response with confidence scoring
Log query, retrieval scores, and sources for observability

When you need this: Customer-facing applications, compliance-heavy industries, high-volume query loads, multi-tenant systems with access controls.

The difference: Basic RAG works 70% of the time. Advanced RAG works 95% of the time. That 25% gap is the difference between a demo and a production system.

Hybrid RAG: Combining Approaches

Sometimes retrieval alone isn’t enough. Hybrid systems combine RAG with other techniques.

Common hybrid patterns:

RAG + Fine-tuning: Fine-tune for domain-specific language, use RAG for up-to-date facts
RAG + Prompt Engineering: Structured prompts guide LLM reasoning over retrieved content
RAG + Knowledge Graphs: Retrieve entities and relationships, not just text chunks
RAG + SQL: Query structured databases for precise data, use RAG for unstructured knowledge

Example deployment: A financial services agent uses RAG for policy documents, SQL queries for account data, and fine-tuning for industry-specific terminology. Each pattern handles what it does best.

Production reality: Most enterprise agents aren’t pure RAG. They’re orchestration layers that route queries to the right retrieval mechanism based on question type.

How to Implement RAG: Step-by-Step

Here’s the methodology used in fast deployments. No theory, just what works.

Step 1: Pick Your Embedding Model

Options:

OpenAI text-embedding-ada-002: Best accuracy, $0.0001 per 1K tokens, closed-source
sentence-transformers/all-MiniLM-L6-v2: Open-source, fast, runs locally
Cohere Embed: Multilingual support, strong performance
voyage-ai: Optimized for retrieval tasks

Decision factors: Cost, latency, language support, data sovereignty requirements.

For most enterprise deployments: Start with OpenAI embeddings if you can send data to APIs. Switch to open-source models (all-MiniLM, BGE) if you need on-premise deployment.

Step 2: Chunk Your Documents

Chunking strategies:

Fixed-size chunks: Split every 500 words (simple, fast, loses context at boundaries)
Semantic chunking: Split by paragraphs or sections (preserves meaning, variable size)
Sliding window: Overlapping chunks to preserve context across splits
Recursive chunking: Split by headers, then paragraphs, then sentences until target size

Production pattern:

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,          # Target chunk size
    chunk_overlap=200,        # Overlap to preserve context
    separators=["\n\n", "\n", ". ", " "]  # Split hierarchy
)

chunks = splitter.split_documents(documents)

Metadata to attach:

Document title and source URL
Creation/update timestamps
Author or department
Document type (policy, FAQ, technical doc)
Access control tags

Why metadata matters: You can filter retrieval by date (“show me policies updated in 2024”) or by permission level (“only show documents this user can access”).

Step 3: Generate and Store Embeddings

Code example with LangChain:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant

embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

vectorstore = Qdrant.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="company_knowledge"
)

What’s happening:

Each chunk gets converted to a 1536-dimension vector
Vectors get stored in Qdrant with metadata
Qdrant builds an HNSW index for fast similarity search

Production detail: Batch your embedding calls (1000 chunks at a time) to avoid rate limits. Monitor embedding costs (large document sets can run $500+ in embedding fees).

Step 4: Build the Retrieval Pipeline

Basic retrieval:

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}  # Retrieve top 5 chunks
)

docs = retriever.get_relevant_documents("What's our return policy?")

Advanced retrieval with reranking:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

compressor = CohereRerank(model="rerank-english-v2.0", top_n=5)

retriever = ContextualCompressionRetriever(
    base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
    base_compressor=compressor
)

docs = retriever.get_relevant_documents("What's our return policy?")

The difference: Basic retrieval pulls the top 5 semantically similar chunks. Reranking pulls 20 candidates, then uses a cross-encoder to find the 5 most relevant to the actual question. Accuracy jumps 15-25% with reranking.

Step 5: Connect Retrieval to LLM

RAG chain with LangChain:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(model="gpt-4", temperature=0)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={
        "prompt": custom_prompt_template
    }
)

result = qa_chain({"query": "What's our return policy for damaged items?"})
print(result["result"])  # LLM-generated answer
print(result["source_documents"])  # Retrieved chunks used

Custom prompt template:

from langchain.prompts import PromptTemplate

template = """You are a helpful assistant answering questions about company policies.

Use the following context to answer the question. If the answer isn't in the context, say "I don't have that information in the provided documents."

Context:
{context}

Question: {question}

Answer:"""

custom_prompt_template = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

Why the custom prompt: Prevents hallucination by explicitly telling the model to only use provided context. Production systems add confidence scoring and source citation requirements.

Step 6: Add Observability and Iteration

What to log:

User query
Retrieval scores for top chunks
LLM response
Source documents used
Response time (retrieval + LLM generation)
User feedback (thumbs up/down)

Why this matters: You can’t improve what you don’t measure. Low retrieval scores mean your chunking strategy is wrong. High retrieval scores but wrong answers mean your prompt needs work. User feedback tells you which queries need better sources.

Iteration loop:

Deploy RAG system
Monitor failed queries (low retrieval scores or negative feedback)
Identify patterns (missing documents, bad chunking, unclear queries)
Add missing knowledge or adjust chunking strategy
Measure improvement, repeat

Fast deployments iterate weekly. Slow deployments wait months between improvements. Speed compounds.

RAG vs. Alternatives: When to Use What

RAG isn’t always the right choice. Here’s when to use it and when to pick something else.

Approach	Best For	Time to Deploy	Cost	Data Freshness
RAG	Domain-specific knowledge, frequently updated data	1 week pilot, 2-4 weeks production	Low (embeddings + vector DB)	Real-time (update DB anytime)
Fine-tuning	Domain-specific language, style consistency	3-6 weeks	Medium-High ($10K-50K per iteration)	Static (retrain to update)
Prompt Engineering	Task formatting, output structure	Hours to days	None (base model only)	Real-time (change prompt anytime)
Context Injection	Small, static knowledge (under 10K tokens)	Minutes	None	Real-time (update prompt)

RAG vs. Fine-Tuning

Use RAG when:

Knowledge changes frequently (product docs, policies, support tickets)
You need to cite sources for answers
Data privacy requires on-premise deployment
You have large knowledge bases (100K+ documents)

Use Fine-tuning when:

You need domain-specific language (medical, legal, technical jargon)
Knowledge is stable and doesn’t change often
You need consistent output formatting
Retrieval overhead is too slow for your use case

Real-world pattern: Combine them. Fine-tune for industry language, use RAG for up-to-date facts.

Example: A legal AI agent is fine-tuned on legal writing style but uses RAG to retrieve case law and statutes. The model sounds like a lawyer, but the facts come from your legal database.

RAG vs. Prompt Engineering

Use RAG when:

Information doesn’t fit in the context window
Knowledge is too large to include in every prompt
Facts need to be verifiable with source citations

Use Prompt Engineering when:

You’re formatting output (JSON, structured data)
You’re guiding reasoning (chain-of-thought, few-shot examples)
Information is small and static

Real-world pattern: Use both. Prompt engineering structures the output, RAG provides the facts.

Example: A customer support agent retrieves product documentation (RAG), then uses a structured prompt to format the response as a step-by-step guide.

RAG vs. Context Injection

Context injection: Stuffing all your knowledge into the system prompt.

When it works: Small knowledge bases under 10K tokens, static information that rarely changes.

When it breaks: Context windows fill up fast. GPT-4’s 128K context can hold ~50 pages of text. If your knowledge base is 500 pages, context injection won’t work. Plus, you pay for every token in the context on every query. RAG only pays for the retrieved chunks.

The decision: If your knowledge fits comfortably in the context window and never changes, skip RAG and just include it in the prompt. If it’s large or dynamic, RAG is cheaper and more maintainable.

Real-World RAG Examples: What Actually Works

Here’s what production RAG systems look like across industries. These aren’t hypotheticals. These are deployment patterns that drive measurable ROI.

Customer Support: 60% Faster Response Times

The use case: A Fortune 500 retailer with 10K+ support articles, product documentation, and policy guides. Agents spend 40% of their time searching for answers.

RAG deployment:

Embedded all support docs, FAQs, and product manuals
Built hybrid search (vector + keyword) to handle specific product SKUs
Added metadata filtering by product category and date
Deployed as Slack bot for internal agents and web widget for customers

Results:

60% reduction in average response time (from 8 minutes to 3 minutes)
40% of queries resolved without human escalation
$47K monthly savings from reduced support tickets

What made it work: Metadata filtering. Agents could search by product line (“show me return policies for electronics”) or by recency (“policies updated in last 30 days”). Generic semantic search wasn’t enough.

Legal Document Review: 70% Faster Due Diligence

The use case: Law firm conducting due diligence on thousands of contracts. Manual review takes weeks per engagement.

RAG deployment:

Chunked contracts by clause type (payment terms, liability, termination)
Used LegalBERT embeddings (fine-tuned for legal language)
Built clause extraction and comparison workflows
Added LLM reasoning over retrieved clauses for risk assessment

Results:

70% reduction in contract review time
Automatic flagging of non-standard clauses
Consistent risk scoring across all contracts

What made it work: Semantic chunking by clause. Fixed-size chunks cut critical legal language mid-sentence. Clause-level chunking preserved meaning and made retrieval far more accurate.

Medical Research: 80% Reduction in Literature Review Time

The use case: Pharmaceutical company researching drug interactions. Researchers manually review hundreds of papers per project.

RAG deployment:

Embedded PubMed abstracts, clinical trial results, and internal research
Used biomedical embeddings (BioBERT)
Added metadata filtering by publication date, study type, sample size
Deployed as research assistant with source citations for every claim

Results:

80% reduction in literature review time
Automatic identification of conflicting study results
Source citations ensure all claims are traceable to published research

What made it work: Domain-specific embeddings. Generic OpenAI embeddings don’t understand medical terminology. BioBERT embeddings capture relationships between diseases, drugs, and treatments that general models miss.

Financial Analysis: 50% Faster Earnings Report Processing

The use case: Investment firm analyzes earnings reports, SEC filings, and analyst calls for thousands of companies.

RAG deployment:

Embedded quarterly earnings transcripts and 10-K/10-Q filings
Built time-series retrieval (compare current quarter to historical performance)
Added structured data extraction (revenue, EPS, guidance) alongside unstructured retrieval
Deployed as internal analyst tool with automatic report generation

Results:

50% reduction in time spent processing earnings reports
Automatic alerts when metrics deviate from historical trends
Faster identification of investment opportunities

What made it work: Hybrid retrieval. Structured financial data (revenue numbers) came from SQL queries. Unstructured insights (management commentary) came from RAG. Combining both gave analysts the full picture.

Internal Knowledge Management: 35% Reduction in Search Time

The use case: Enterprise with 50K+ internal documents spread across SharePoint, Confluence, Google Drive, and email.

RAG deployment:

Unified search across all knowledge sources
Embedded wikis, meeting notes, design docs, onboarding materials
Added access control filtering (only show documents user has permission to see)
Deployed as enterprise search + Slack/Teams integration

Results:

35% reduction in time employees spend searching for information
50% reduction in duplicate documentation (RAG surfaces existing docs)
Onboarding time cut from 4 weeks to 2 weeks

What made it work: Access control. A unified search is useless if it surfaces documents users can’t access. Metadata tagging with permission levels ensured retrieval respected existing access policies.

Deploy RAG in Under a Week with TMA

Most teams spend 3-6 months on RAG deployments. Fast deployments can be done in one week or less for most use cases.

Here’s the methodology.

Day 1: Define the Hero Metric

What are we measuring? Time saved, cost reduced, tickets deflected, documents processed?

If you can’t tie RAG to dollars saved or earned, don’t build it. Pick a use case with clear ROI. Customer support, document processing, internal knowledge retrieval. All have measurable outcomes.

Day 2-3: Data Preparation

Collect your documents, clean the data, chunk them, generate embeddings. This is 60% of the work.

Most deployments fail here because they underestimate data quality issues. PDFs with broken formatting, scanned documents without OCR, inconsistent metadata. Clean your data first or retrieval will be garbage.

Day 4: Build the Retrieval Pipeline

Set up your vector database, wire it to an embedding model, test retrieval quality. Query your knowledge base with real questions. Check if the top 5 results actually contain the answer.

If retrieval is wrong, adjust chunking or try hybrid search. Don’t move to LLM integration until retrieval works.

Day 5: Connect to LLM and Test

Wire your retriever to the LLM, build the prompt template, run test queries. Check for hallucinations. Verify source citations.

Test edge cases. What happens when the answer isn’t in the knowledge base? Does it say “I don’t know” or make something up?

Day 6-7: Deploy Pilot

Ship it. Internal Slack bot, API endpoint for your customer support tool, web interface for employees. Start small. Measure results. Iterate.

Fast pilots beat slow perfection. Get feedback from real users, fix the obvious problems, deploy version 2.

Production hardening (weeks 2-4):

Add reranking for better retrieval accuracy
Implement hybrid search (vector + keyword)
Add observability and logging
Scale infrastructure for production load
Integrate with existing auth and access controls

TMA’s differentiator: We’ve done this before. The methodology is proven. While others are scheduling discovery meetings, we’re processing real queries with real data. That’s the speed advantage.

What Goes Wrong with RAG: TMA Honesty

Here’s what breaks in production and how to fix it.

Chunking Mistakes: Context Loss

The problem: Split a 10-page document every 500 words and you’ll cut critical context in half. A sentence about “the policy applies to all full-time employees” ends up in one chunk while the actual policy details land in the next chunk.

What breaks: Retrieval pulls the policy details but misses the scope limitation. The LLM answers as if the policy applies to everyone.

The fix: Semantic chunking. Split by headers, paragraphs, or natural section breaks. Add overlapping chunks so context bleeds across boundaries. Test retrieval with queries that require multi-paragraph reasoning.

Production pattern: Use RecursiveCharacterTextSplitter with overlap, or build custom chunking logic that respects document structure.

Metadata Failures: Wrong Documents Retrieved

The problem: You retrieve a policy from 2019 when the user needs the updated 2024 version. Or you surface a document the user doesn’t have permission to access.

What breaks: Answers are technically correct but outdated or unauthorized. Compliance nightmare.

The fix: Attach metadata to every chunk (date, version, access level, document type). Filter retrieval by metadata before sending results to the LLM.

Production pattern:

retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 10,
        "filter": {
            "date": {"$gte": "2024-01-01"},
            "access_level": {"$in": user.permissions}
        }
    }
)

Reranking Gaps: Semantic Search Isn’t Enough

The problem: Vector similarity retrieves documents with similar words but different meanings. Query for “Python package management” and get results about shipping packages.

What breaks: Top 5 results look relevant to the embedding model but don’t answer the question.

The fix: Add a reranking step. Pull 20-50 candidates with vector search, then use a cross-encoder (Cohere Rerank, cross-encoder/ms-marco-MiniLM) to rerank by actual relevance to the query.

Production pattern: Retrieval recall goes from 70% to 90%+ with reranking. It’s not optional for high-stakes use cases.

Hallucination Despite RAG

The problem: Even with retrieved context, the LLM makes up details that aren’t in the source documents.

What breaks: User trust. One hallucinated fact undermines credibility of the entire system.

The fix: Prompt engineering and validation. Explicitly tell the model to only use provided context. Add a validation step that checks if the LLM’s answer is supported by the retrieved chunks. If validation fails, return “I don’t have enough information” instead of the generated answer.

Production pattern:

prompt = """Use ONLY the following context to answer the question.
If the answer isn't in the context, respond with:
"I don't have that information in the provided documents."

Do not make up information. Do not use your training data.
Only use the context below.

Context: {context}
Question: {question}
Answer:"""

Query Transformation Gaps

The problem: Users ask vague questions like “What’s the return thing?” The embedding model doesn’t know what “return thing” means. Retrieval fails.

What breaks: Bad questions get bad answers, even if the knowledge exists in your database.

The fix: Query transformation. Rewrite vague queries into specific searches before retrieval.

Production pattern: Use an LLM to expand “What’s the return thing?” into “What is the return policy for damaged or defective items?” Then embed the expanded query.

Scale and Cost Surprises

The problem: Embedding 1M documents costs $100 in API fees. Retrieval adds 100ms latency per query. LLM inference costs stack up at high volume.

What breaks: Proof of concept works great at 100 queries/day. Production deployment at 10K queries/day costs $5K/month and has 2-second response times.

The fix: Batch embedding generation. Use open-source embedding models for cost-sensitive deployments. Cache frequent queries. Optimize LLM usage (smaller models for simple queries, GPT-4 only when needed).

Production pattern: Monitor costs from day one. Most RAG deployments don’t fail technically, they fail economically because no one tracked spend.

Master RAG with Agent Guild

Want to build production RAG systems that actually ship? Join the Agent Guild.

What you get:

Real-world RAG deployment case studies
Production-grade architecture patterns
Access to engineers who’ve shipped 100+ RAG agents
Weekly build sessions and code reviews
Shared cost, shared upside on joint ventures

The model: You bring domain expertise and distribution. We bring the AI engineering muscle. We co-build RAG systems for your industry, share costs, share profits. You’re not hiring a dev agency. You’re partnering with builders who’ve done this before.

Who this is for:

Domain experts with distribution (compliance, legal, medical, finance)
A-player engineers who want to ship agents full-time
Founders ready to build AI products, not AI demos

This isn’t a course. It’s a community of builders who ship production agents measured by ROI, not vibes.

RAG Implementation Code: LangChain and LlamaIndex

Here’s working code you can deploy today.

LangChain RAG Pipeline

Full implementation:

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# 1. Load documents
loader = DirectoryLoader(
    './data',
    glob="**/*.txt",
    loader_cls=TextLoader
)
documents = loader.load()

# 2. Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_documents(documents)

# 3. Generate embeddings and store in vector database
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = Qdrant.from_documents(
    documents=chunks,
    embedding=embeddings,
    url="http://localhost:6333",
    collection_name="knowledge_base"
)

# 4. Set up retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k": 5}
)

# 5. Create custom prompt
template = """You are a helpful assistant. Answer the question using only the context provided.

Context: {context}

Question: {question}

If the answer isn't in the context, say: "I don't have that information."

Answer:"""

prompt = PromptTemplate(
    template=template,
    input_variables=["context", "question"]
)

# 6. Build RAG chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

# 7. Query the system
result = qa_chain({"query": "What is the return policy?"})
print(f"Answer: {result['result']}")
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")

LlamaIndex RAG Pipeline

Alternative implementation:

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    ServiceContext,
    StorageContext,
)
from llama_index.embeddings import OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index.vector_stores import QdrantVectorStore
import qdrant_client

# 1. Load documents
documents = SimpleDirectoryReader('./data').load_data()

# 2. Set up Qdrant vector store
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(
    client=client,
    collection_name="knowledge_base"
)

# 3. Configure service context
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
llm = OpenAI(model="gpt-4", temperature=0)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    chunk_size=1000,
    chunk_overlap=200
)

# 4. Build index
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)

index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context,
    storage_context=storage_context
)

# 5. Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"
)

# 6. Query the system
response = query_engine.query("What is the return policy?")
print(f"Answer: {response}")
print(f"Sources: {response.source_nodes}")

Production Enhancements

Add reranking with Cohere:

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank

# Wrap your base retriever with reranking
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
compressor = CohereRerank(model="rerank-english-v2.0", top_n=5)

retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    base_compressor=compressor
)

Add metadata filtering:

# Filter by date and document type
retriever = vectorstore.as_retriever(
    search_kwargs={
        "k": 5,
        "filter": {
            "date": {"$gte": "2024-01-01"},
            "type": "policy_document"
        }
    }
)

Add hybrid search (vector + keyword):

from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever

# Combine vector search with keyword search
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
keyword_retriever = BM25Retriever.from_documents(chunks, k=5)

ensemble_retriever = EnsembleRetriever(
    retrievers=[vector_retriever, keyword_retriever],
    weights=[0.7, 0.3]  # 70% vector, 30% keyword
)

Frequently Asked Questions

What is a RAG system?

A RAG (Retrieval-Augmented Generation) system combines large language models with external knowledge retrieval. Instead of relying only on training data, RAG systems search a knowledge base for relevant information, then use that context to generate accurate, grounded responses.

How does RAG differ from fine-tuning?

RAG retrieves information at query time from a knowledge base. Fine-tuning retrains the model on new data. RAG is faster (days vs. weeks), cheaper (no GPU training costs), and handles dynamic data better (update the database, not the model). Fine-tuning is better for domain-specific language and style consistency.

When should I use RAG instead of prompt engineering?

Use RAG when your knowledge is too large to fit in the context window or changes frequently. Use prompt engineering when you’re formatting output or guiding reasoning. Most production systems use both: RAG provides the facts, prompt engineering structures the response.

What vector databases work best for RAG?

Qdrant, Weaviate, Pinecone, and pgvector (Postgres extension) are the most common. Qdrant and Weaviate offer the best performance for production deployments. Pinecone is fully managed but cloud-only. Pgvector works if you want vector search in your existing Postgres database.

How long does it take to deploy a RAG system?

Working pilots can be done in one week or less for most use cases. Production hardening (reranking, hybrid search, observability) takes 2-4 weeks depending on data complexity. Full deployments with integrations, access controls, and scaling take 4-8 weeks.

What's the cost of running a RAG system?

Embedding generation: $0.0001 per 1K tokens (one-time cost). Vector database hosting: $50-500/month depending on scale. LLM inference: $0.01-0.03 per query (GPT-4). Total monthly cost for 10K queries/month: $200-500. Far cheaper than fine-tuning at $10K-50K per iteration.

Can RAG work with proprietary data?

Yes. That’s the primary use case. RAG keeps data in your infrastructure. You control the vector database, embeddings run on-premise or in your VPC, and only the query and retrieved chunks go to the LLM API (or you can run open-source LLMs locally).

How do I prevent hallucinations in RAG systems?

Prompt engineering: Explicitly tell the model to only use provided context. Retrieval quality: If retrieval is bad, the LLM has nothing accurate to work with. Validation: Check if the generated answer is supported by the retrieved documents. Confidence scoring: Return “I don’t have enough information” when retrieval scores are low.

What chunk size should I use for RAG?

500-1000 words is the standard starting point. Smaller chunks (200-300 words) work for FAQ-style content. Larger chunks (1500-2000 words) work for long-form documents where context matters. Use overlap (200 words) to preserve context across chunk boundaries. Test retrieval quality and adjust.

How do I handle multi-turn conversations in RAG?

Track conversation history and include previous Q&A pairs in the context. Rewrite the current question to be standalone (expand pronouns, add context from previous turns). Example: “What about pricing?” becomes “What is the pricing for the product mentioned earlier?” before retrieval.

What's the difference between RAG and semantic search?

Semantic search retrieves relevant documents. RAG retrieves documents AND generates a synthesized answer. Semantic search returns “here are 5 related articles.” RAG returns “based on these articles, here’s the answer to your question.”

Can I use RAG with open-source models?

Yes. Use open-source embedding models (sentence-transformers, BGE) and open-source LLMs (Llama, Mistral, Falcon). This eliminates API dependencies and keeps everything on-premise. Trade-off: Open-source models are less accurate than GPT-4, but good enough for many use cases.

How do I evaluate RAG quality?

Retrieval quality: Measure precision (are the top results relevant?) and recall (do the results contain the answer?). Generation quality: Human evaluation, automated scoring (BLEU, ROUGE), user feedback (thumbs up/down). Production metric: Does the answer solve the user’s problem?

What's hybrid search in RAG?

Hybrid search combines vector similarity (semantic search) with keyword matching (BM25). Vector search handles “What’s the return policy?” Keyword search handles “Show me documents mentioning SKU-12345.” Combining both improves accuracy by 15-25% over vector search alone.

How does reranking improve RAG?

Reranking pulls 20-50 candidates with fast vector search, then uses a slower but more accurate cross-encoder to find the best 5. Think of it as a two-stage filter: fast-and-loose retrieval, then careful selection. Boosts accuracy without sacrificing speed.

Can RAG cite sources?

Yes. Most RAG implementations return the source documents used to generate the answer. You can display document titles, URLs, page numbers, or timestamps. Critical for enterprise use cases where answers need to be verifiable.

What metadata should I attach to chunks?

Document title and source URL, creation and update timestamps, author or department, document type (policy, FAQ, technical doc), access control tags (who can see this), version numbers. Metadata enables filtering (“show me policies updated in 2024”) and access control.

How do I scale RAG to millions of documents?

Use approximate nearest neighbor search (HNSW, IVF) instead of exact search. Shard your vector database across multiple nodes. Cache frequent queries. Use smaller, faster embedding models. Monitor costs and latency as you scale.

What's the difference between RAG and knowledge graphs?

RAG retrieves unstructured text chunks. Knowledge graphs retrieve structured entities and relationships. Example: RAG retrieves paragraphs about “Apple Inc.” A knowledge graph retrieves (Apple, founded_by, Steve Jobs) and (Apple, headquarters, Cupertino). Some systems combine both.

Can I update RAG knowledge in real-time?

Yes. Add new documents to your vector database and they’re immediately searchable. No model retraining needed. This is RAG’s biggest advantage over fine-tuning.

How do I handle access controls in RAG?

Attach access control metadata to each chunk (department, role, user ID). Filter retrieval by the current user’s permissions. Example: Only retrieve documents tagged “Engineering” for engineers, “Sales” for sales team. Never return chunks the user isn’t authorized to see.

What's the best embedding model for RAG?

OpenAI text-embedding-ada-002 offers the best accuracy for English. Cohere Embed supports 100+ languages. Open-source models (all-MiniLM, BGE) work for on-premise deployments. Domain-specific models (BioBERT for medical, LegalBERT for law) outperform general models in specialized fields.

Can RAG work for customer-facing applications?

Yes, if retrieval quality is high and you add validation to prevent hallucinations. Customer-facing RAG needs reranking, hybrid search, confidence scoring, and extensive testing. Internal tools can tolerate occasional wrong answers. Customer-facing tools can’t.

How do I debug poor RAG retrieval?

Log retrieval scores for every query. Low scores mean poor semantic match. Check if the answer exists in your knowledge base. Review chunk boundaries (are you splitting context?). Try hybrid search or query transformation. Test different embedding models.

What's query transformation in RAG?

Rewriting vague queries into specific searches before retrieval. Example: “What’s the return thing?” becomes “What is the return policy for damaged items?” Improves retrieval accuracy for poorly-phrased questions.

Can I combine RAG with SQL databases?

Yes. Use RAG for unstructured knowledge (documents, FAQs) and SQL for structured data (customer records, transactions). Route queries to the right system based on question type. Example: “What’s our revenue?” hits SQL. “What’s our return policy?” hits RAG.

How do I monitor RAG performance in production?

Log every query, retrieval scores, source documents, LLM response, and user feedback. Track response time (retrieval + generation), success rate (user thumbs up/down), and cost per query. Monitor failed queries (low retrieval scores, negative feedback) and iterate weekly.

What's the ROI of deploying RAG?

Depends on the use case. Customer support: 60-80% faster response times, 40% fewer escalations. Document review: 70% time savings. Knowledge management: 35% reduction in search time. Calculate hours saved × hourly cost to get monthly ROI.

Can RAG replace Google search for internal documents?

Yes. That’s one of the most common enterprise use cases. RAG provides conversational answers, not just document links. Employees ask questions in natural language and get direct answers with source citations.

How do I transition from RAG pilot to production?

Add reranking and hybrid search for better accuracy. Implement access controls and metadata filtering. Set up observability and logging. Load test at production scale. Integrate with existing auth systems. Monitor costs and optimize. Fast pilots ship in one week. Production hardening takes 2-4 weeks.

What happens when RAG can't find the answer?

Return “I don’t have that information in the provided documents” instead of hallucinating. Log the failed query so you can identify missing knowledge and add it to your database. Production systems should say “I don’t know” rather than make up an answer.

What Is a RAG System?

How RAG Works: The 3-Step Process

Step 1: Retrieve

Step 2: Augment

Step 3: Generate

Why AI Teams Need RAG: The Business Case

Data Sovereignty

Speed to Production

Cost Efficiency

Measurable ROI

RAG Architecture Patterns: Basic to Advanced

Basic RAG: The Starting Point

Advanced RAG: Production-Grade

Hybrid RAG: Combining Approaches

How to Implement RAG: Step-by-Step

Step 1: Pick Your Embedding Model

Step 2: Chunk Your Documents

Step 3: Generate and Store Embeddings

Step 4: Build the Retrieval Pipeline

Step 5: Connect Retrieval to LLM

Step 6: Add Observability and Iteration

RAG vs. Alternatives: When to Use What

RAG vs. Fine-Tuning

RAG vs. Prompt Engineering

RAG vs. Context Injection

Real-World RAG Examples: What Actually Works

Customer Support: 60% Faster Response Times

Legal Document Review: 70% Faster Due Diligence

Medical Research: 80% Reduction in Literature Review Time

Financial Analysis: 50% Faster Earnings Report Processing

Internal Knowledge Management: 35% Reduction in Search Time

Deploy RAG in Under a Week with TMA

What Goes Wrong with RAG: TMA Honesty

Chunking Mistakes: Context Loss

Metadata Failures: Wrong Documents Retrieved

Reranking Gaps: Semantic Search Isn’t Enough

Hallucination Despite RAG

Query Transformation Gaps

Scale and Cost Surprises

Master RAG with Agent Guild

RAG Implementation Code: LangChain and LlamaIndex

LangChain RAG Pipeline

LlamaIndex RAG Pipeline

Production Enhancements

Frequently Asked Questions

Related Terms