RAG System
Quick Answer: A RAG (Retrieval-Augmented Generation) system combines large language models with external knowledge retrieval to generate accurate, domain-specific responses grounded in private data.
What Is a RAG System?
RAG (Retrieval-Augmented Generation) is an architecture pattern that connects large language models to your private data without retraining the model.
Here’s the problem RAG solves: LLMs are smart, but they only know what they were trained on. Ask ChatGPT about your company’s internal policies, customer support history, or proprietary research, and it draws a blank. You can’t fine-tune a model every time your data changes. That’s expensive, slow, and breaks the moment you add new information.
RAG fixes this. Instead of stuffing everything into the model’s training data, you store knowledge in a searchable database. When someone asks a question, the system retrieves relevant information first, then feeds it to the LLM along with the query. The model generates an answer grounded in your actual data, not hallucinated from thin air.
Why RAG matters for enterprise: Most Fortune 500 companies won’t send proprietary data to external APIs for fine-tuning. RAG lets you keep data in your infrastructure while still getting LLM-powered insights. Your documents stay in your vector database. Your embeddings run in your environment. Zero data leakage.
The pattern is simple: Retrieve the right context, augment the prompt with that context, generate a response. That’s it. But the implementation details separate working pilots from production systems that actually save money.
How RAG Works: The 3-Step Process
Every RAG system follows the same basic flow, whether you’re building customer support agents or internal knowledge retrieval.
Step 1: Retrieve
When a user asks a question, you don’t send it straight to the LLM. You search your knowledge base first.
The process:
- Convert the user’s question into a vector embedding (a numerical representation)
- Search your vector database for similar embeddings (semantic search)
- Rank results by relevance score
- Pull the top N most relevant chunks (usually 3-10)
Example: User asks “What’s our return policy for damaged items?” The system searches embeddings of your knowledge base and retrieves the 5 most relevant policy documents about returns, damage claims, and refund procedures.
What makes retrieval hard: Not all content chunks are created equal. A 500-word document split into 100-word chunks might separate critical context. Retrieval quality depends on how you chunk, what metadata you attach, and whether you rerank results before sending them to the LLM.
Step 2: Augment
Take the retrieved documents and stuff them into the LLM’s context window along with the original question.
The prompt structure:
Context: [Retrieved documents 1-5]
Question: What's our return policy for damaged items?
Instructions: Answer based only on the provided context. If the answer isn't in the context, say so.
You’re not changing the model. You’re changing the input. The LLM sees both the question and the supporting evidence at the same time.
Why this works: LLMs are excellent at reasoning over provided information. They struggle when they have to recall training data from months ago. RAG gives them fresh, specific context for every query.
Step 3: Generate
The LLM reads the retrieved context and generates a response. Because the answer is grounded in real documents, hallucination rates drop dramatically.
What you get:
- Answers based on your actual data, not the model’s training set
- Source citations showing which documents informed the response
- The ability to update knowledge without retraining (just update the vector database)
Production detail: Most enterprise RAG systems include a confidence score with each response. If retrieval scores are low, the system says “I don’t have enough information” instead of making something up. That’s the difference between a demo and a system you can trust with customer-facing queries.
Why AI Teams Need RAG: The Business Case
If you’re building AI agents that interact with company-specific knowledge, you need RAG. Here’s why.
Data Sovereignty
Fine-tuning sends your data to third-party APIs. RAG keeps it in your infrastructure.
For Fortune 500 companies with compliance requirements, this isn’t optional. You can’t send customer PII, financial records, or proprietary research to OpenAI’s servers for model training. But you can store embeddings in your own vector database and query them locally.
The deployment pattern: Run embeddings on-premise or in your VPC. Store vectors in Qdrant, Weaviate, or Postgres with pgvector. Query LLMs via API, but never send raw documents through the wire. Only send the question and retrieved chunks (which you control).
Speed to Production
Fine-tuning a model takes weeks. Training from scratch takes months. RAG deployments can go live in days.
Real-world timeline: Working RAG pilot in one week or less for most use cases. Chunk your documents, generate embeddings, set up vector search, wire it to an LLM. Done. Production hardening (reranking, metadata filtering, hybrid search) takes another 2-4 weeks depending on data complexity.
Compare that to fine-tuning, where you need labeled data, multiple training runs, evaluation datasets, and model deployment infrastructure. By the time a fine-tuned model is ready, your data has changed and you’re starting over.
Cost Efficiency
RAG is cheaper than fine-tuning at enterprise scale.
The math: Fine-tuning costs $10K-50K+ per model depending on data size and iteration cycles. RAG uses pre-trained models with retrieval overhead. Embedding generation is a one-time cost. Vector search is fast and cheap. LLM inference costs are the same whether you use RAG or fine-tuning.
Where RAG wins: You update knowledge by adding documents to your database, not by retraining models. That’s a database insert, not a GPU cluster.
Measurable ROI
RAG ties directly to hero metrics that move P&L.
Use cases with clear ROI:
- Customer support: 60-80% reduction in response time, 40% reduction in support tickets escalated to humans
- Internal knowledge retrieval: 35% reduction in time spent searching documentation
- Research analysis: 70% faster document review for due diligence, compliance, legal discovery
- Data processing: 50% reduction in manual data entry and classification tasks
These aren’t vanity metrics. They’re hours saved and costs reduced. That’s the AI with ROI promise.
RAG Architecture Patterns: Basic to Advanced
Not all RAG systems are built the same. Here’s how the architecture evolves from prototype to production.
Basic RAG: The Starting Point
Components:
- Document chunker (split text into 500-1000 word segments)
- Embedding model (text-embedding-ada-002 or open-source alternatives)
- Vector database (Qdrant, Weaviate, Pinecone, pgvector)
- LLM (GPT-4, Claude, or open-source models)
Flow:
- Chunk documents and generate embeddings
- Store embeddings in vector database with metadata
- On query: embed question, retrieve top-k chunks, send to LLM
- Return LLM response with source citations
When this works: Small knowledge bases (under 10K documents), low query volume, internal tools where occasional wrong answers aren’t critical.
When this breaks: Large document sets where simple semantic search returns irrelevant chunks. High-stakes use cases where accuracy matters. Complex queries that require reasoning across multiple documents.
Advanced RAG: Production-Grade
Additional components:
- Hybrid search (combine vector similarity with keyword matching)
- Reranking models (Cohere Rerank, cross-encoders)
- Metadata filtering (date ranges, document types, access controls)
- Query transformation (rewrite vague questions into specific searches)
- Answer validation (check if retrieved context actually supports the answer)
Flow:
- Transform user query for better retrieval
- Run hybrid search (vector + keyword)
- Filter by metadata (user permissions, date relevance)
- Rerank top 50 results to find best 5
- Validate retrieved chunks contain answer-relevant information
- Generate response with confidence scoring
- Log query, retrieval scores, and sources for observability
When you need this: Customer-facing applications, compliance-heavy industries, high-volume query loads, multi-tenant systems with access controls.
The difference: Basic RAG works 70% of the time. Advanced RAG works 95% of the time. That 25% gap is the difference between a demo and a production system.
Hybrid RAG: Combining Approaches
Sometimes retrieval alone isn’t enough. Hybrid systems combine RAG with other techniques.
Common hybrid patterns:
- RAG + Fine-tuning: Fine-tune for domain-specific language, use RAG for up-to-date facts
- RAG + Prompt Engineering: Structured prompts guide LLM reasoning over retrieved content
- RAG + Knowledge Graphs: Retrieve entities and relationships, not just text chunks
- RAG + SQL: Query structured databases for precise data, use RAG for unstructured knowledge
Example deployment: A financial services agent uses RAG for policy documents, SQL queries for account data, and fine-tuning for industry-specific terminology. Each pattern handles what it does best.
Production reality: Most enterprise agents aren’t pure RAG. They’re orchestration layers that route queries to the right retrieval mechanism based on question type.
How to Implement RAG: Step-by-Step
Here’s the methodology used in fast deployments. No theory, just what works.
Step 1: Pick Your Embedding Model
Options:
- OpenAI text-embedding-ada-002: Best accuracy, $0.0001 per 1K tokens, closed-source
- sentence-transformers/all-MiniLM-L6-v2: Open-source, fast, runs locally
- Cohere Embed: Multilingual support, strong performance
- voyage-ai: Optimized for retrieval tasks
Decision factors: Cost, latency, language support, data sovereignty requirements.
For most enterprise deployments: Start with OpenAI embeddings if you can send data to APIs. Switch to open-source models (all-MiniLM, BGE) if you need on-premise deployment.
Step 2: Chunk Your Documents
Chunking strategies:
- Fixed-size chunks: Split every 500 words (simple, fast, loses context at boundaries)
- Semantic chunking: Split by paragraphs or sections (preserves meaning, variable size)
- Sliding window: Overlapping chunks to preserve context across splits
- Recursive chunking: Split by headers, then paragraphs, then sentences until target size
Production pattern:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Target chunk size
chunk_overlap=200, # Overlap to preserve context
separators=["\n\n", "\n", ". ", " "] # Split hierarchy
)
chunks = splitter.split_documents(documents)
Metadata to attach:
- Document title and source URL
- Creation/update timestamps
- Author or department
- Document type (policy, FAQ, technical doc)
- Access control tags
Why metadata matters: You can filter retrieval by date (“show me policies updated in 2024”) or by permission level (“only show documents this user can access”).
Step 3: Generate and Store Embeddings
Code example with LangChain:
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = Qdrant.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="company_knowledge"
)
What’s happening:
- Each chunk gets converted to a 1536-dimension vector
- Vectors get stored in Qdrant with metadata
- Qdrant builds an HNSW index for fast similarity search
Production detail: Batch your embedding calls (1000 chunks at a time) to avoid rate limits. Monitor embedding costs (large document sets can run $500+ in embedding fees).
Step 4: Build the Retrieval Pipeline
Basic retrieval:
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5} # Retrieve top 5 chunks
)
docs = retriever.get_relevant_documents("What's our return policy?")
Advanced retrieval with reranking:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
compressor = CohereRerank(model="rerank-english-v2.0", top_n=5)
retriever = ContextualCompressionRetriever(
base_retriever=vectorstore.as_retriever(search_kwargs={"k": 20}),
base_compressor=compressor
)
docs = retriever.get_relevant_documents("What's our return policy?")
The difference: Basic retrieval pulls the top 5 semantically similar chunks. Reranking pulls 20 candidates, then uses a cross-encoder to find the 5 most relevant to the actual question. Accuracy jumps 15-25% with reranking.
Step 5: Connect Retrieval to LLM
RAG chain with LangChain:
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={
"prompt": custom_prompt_template
}
)
result = qa_chain({"query": "What's our return policy for damaged items?"})
print(result["result"]) # LLM-generated answer
print(result["source_documents"]) # Retrieved chunks used
Custom prompt template:
from langchain.prompts import PromptTemplate
template = """You are a helpful assistant answering questions about company policies.
Use the following context to answer the question. If the answer isn't in the context, say "I don't have that information in the provided documents."
Context:
{context}
Question: {question}
Answer:"""
custom_prompt_template = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
Why the custom prompt: Prevents hallucination by explicitly telling the model to only use provided context. Production systems add confidence scoring and source citation requirements.
Step 6: Add Observability and Iteration
What to log:
- User query
- Retrieval scores for top chunks
- LLM response
- Source documents used
- Response time (retrieval + LLM generation)
- User feedback (thumbs up/down)
Why this matters: You can’t improve what you don’t measure. Low retrieval scores mean your chunking strategy is wrong. High retrieval scores but wrong answers mean your prompt needs work. User feedback tells you which queries need better sources.
Iteration loop:
- Deploy RAG system
- Monitor failed queries (low retrieval scores or negative feedback)
- Identify patterns (missing documents, bad chunking, unclear queries)
- Add missing knowledge or adjust chunking strategy
- Measure improvement, repeat
Fast deployments iterate weekly. Slow deployments wait months between improvements. Speed compounds.
RAG vs. Alternatives: When to Use What
RAG isn’t always the right choice. Here’s when to use it and when to pick something else.
| Approach | Best For | Time to Deploy | Cost | Data Freshness |
|---|---|---|---|---|
| RAG | Domain-specific knowledge, frequently updated data | 1 week pilot, 2-4 weeks production | Low (embeddings + vector DB) | Real-time (update DB anytime) |
| Fine-tuning | Domain-specific language, style consistency | 3-6 weeks | Medium-High ($10K-50K per iteration) | Static (retrain to update) |
| Prompt Engineering | Task formatting, output structure | Hours to days | None (base model only) | Real-time (change prompt anytime) |
| Context Injection | Small, static knowledge (under 10K tokens) | Minutes | None | Real-time (update prompt) |
RAG vs. Fine-Tuning
Use RAG when:
- Knowledge changes frequently (product docs, policies, support tickets)
- You need to cite sources for answers
- Data privacy requires on-premise deployment
- You have large knowledge bases (100K+ documents)
Use Fine-tuning when:
- You need domain-specific language (medical, legal, technical jargon)
- Knowledge is stable and doesn’t change often
- You need consistent output formatting
- Retrieval overhead is too slow for your use case
Real-world pattern: Combine them. Fine-tune for industry language, use RAG for up-to-date facts.
Example: A legal AI agent is fine-tuned on legal writing style but uses RAG to retrieve case law and statutes. The model sounds like a lawyer, but the facts come from your legal database.
RAG vs. Prompt Engineering
Use RAG when:
- Information doesn’t fit in the context window
- Knowledge is too large to include in every prompt
- Facts need to be verifiable with source citations
Use Prompt Engineering when:
- You’re formatting output (JSON, structured data)
- You’re guiding reasoning (chain-of-thought, few-shot examples)
- Information is small and static
Real-world pattern: Use both. Prompt engineering structures the output, RAG provides the facts.
Example: A customer support agent retrieves product documentation (RAG), then uses a structured prompt to format the response as a step-by-step guide.
RAG vs. Context Injection
Context injection: Stuffing all your knowledge into the system prompt.
When it works: Small knowledge bases under 10K tokens, static information that rarely changes.
When it breaks: Context windows fill up fast. GPT-4’s 128K context can hold ~50 pages of text. If your knowledge base is 500 pages, context injection won’t work. Plus, you pay for every token in the context on every query. RAG only pays for the retrieved chunks.
The decision: If your knowledge fits comfortably in the context window and never changes, skip RAG and just include it in the prompt. If it’s large or dynamic, RAG is cheaper and more maintainable.
Real-World RAG Examples: What Actually Works
Here’s what production RAG systems look like across industries. These aren’t hypotheticals. These are deployment patterns that drive measurable ROI.
Customer Support: 60% Faster Response Times
The use case: A Fortune 500 retailer with 10K+ support articles, product documentation, and policy guides. Agents spend 40% of their time searching for answers.
RAG deployment:
- Embedded all support docs, FAQs, and product manuals
- Built hybrid search (vector + keyword) to handle specific product SKUs
- Added metadata filtering by product category and date
- Deployed as Slack bot for internal agents and web widget for customers
Results:
- 60% reduction in average response time (from 8 minutes to 3 minutes)
- 40% of queries resolved without human escalation
- $47K monthly savings from reduced support tickets
What made it work: Metadata filtering. Agents could search by product line (“show me return policies for electronics”) or by recency (“policies updated in last 30 days”). Generic semantic search wasn’t enough.
Legal Document Review: 70% Faster Due Diligence
The use case: Law firm conducting due diligence on thousands of contracts. Manual review takes weeks per engagement.
RAG deployment:
- Chunked contracts by clause type (payment terms, liability, termination)
- Used LegalBERT embeddings (fine-tuned for legal language)
- Built clause extraction and comparison workflows
- Added LLM reasoning over retrieved clauses for risk assessment
Results:
- 70% reduction in contract review time
- Automatic flagging of non-standard clauses
- Consistent risk scoring across all contracts
What made it work: Semantic chunking by clause. Fixed-size chunks cut critical legal language mid-sentence. Clause-level chunking preserved meaning and made retrieval far more accurate.
Medical Research: 80% Reduction in Literature Review Time
The use case: Pharmaceutical company researching drug interactions. Researchers manually review hundreds of papers per project.
RAG deployment:
- Embedded PubMed abstracts, clinical trial results, and internal research
- Used biomedical embeddings (BioBERT)
- Added metadata filtering by publication date, study type, sample size
- Deployed as research assistant with source citations for every claim
Results:
- 80% reduction in literature review time
- Automatic identification of conflicting study results
- Source citations ensure all claims are traceable to published research
What made it work: Domain-specific embeddings. Generic OpenAI embeddings don’t understand medical terminology. BioBERT embeddings capture relationships between diseases, drugs, and treatments that general models miss.
Financial Analysis: 50% Faster Earnings Report Processing
The use case: Investment firm analyzes earnings reports, SEC filings, and analyst calls for thousands of companies.
RAG deployment:
- Embedded quarterly earnings transcripts and 10-K/10-Q filings
- Built time-series retrieval (compare current quarter to historical performance)
- Added structured data extraction (revenue, EPS, guidance) alongside unstructured retrieval
- Deployed as internal analyst tool with automatic report generation
Results:
- 50% reduction in time spent processing earnings reports
- Automatic alerts when metrics deviate from historical trends
- Faster identification of investment opportunities
What made it work: Hybrid retrieval. Structured financial data (revenue numbers) came from SQL queries. Unstructured insights (management commentary) came from RAG. Combining both gave analysts the full picture.
Internal Knowledge Management: 35% Reduction in Search Time
The use case: Enterprise with 50K+ internal documents spread across SharePoint, Confluence, Google Drive, and email.
RAG deployment:
- Unified search across all knowledge sources
- Embedded wikis, meeting notes, design docs, onboarding materials
- Added access control filtering (only show documents user has permission to see)
- Deployed as enterprise search + Slack/Teams integration
Results:
- 35% reduction in time employees spend searching for information
- 50% reduction in duplicate documentation (RAG surfaces existing docs)
- Onboarding time cut from 4 weeks to 2 weeks
What made it work: Access control. A unified search is useless if it surfaces documents users can’t access. Metadata tagging with permission levels ensured retrieval respected existing access policies.
Deploy RAG in Under a Week with TMA
Most teams spend 3-6 months on RAG deployments. Fast deployments can be done in one week or less for most use cases.
Here’s the methodology.
Day 1: Define the Hero Metric
What are we measuring? Time saved, cost reduced, tickets deflected, documents processed?
If you can’t tie RAG to dollars saved or earned, don’t build it. Pick a use case with clear ROI. Customer support, document processing, internal knowledge retrieval. All have measurable outcomes.
Day 2-3: Data Preparation
Collect your documents, clean the data, chunk them, generate embeddings. This is 60% of the work.
Most deployments fail here because they underestimate data quality issues. PDFs with broken formatting, scanned documents without OCR, inconsistent metadata. Clean your data first or retrieval will be garbage.
Day 4: Build the Retrieval Pipeline
Set up your vector database, wire it to an embedding model, test retrieval quality. Query your knowledge base with real questions. Check if the top 5 results actually contain the answer.
If retrieval is wrong, adjust chunking or try hybrid search. Don’t move to LLM integration until retrieval works.
Day 5: Connect to LLM and Test
Wire your retriever to the LLM, build the prompt template, run test queries. Check for hallucinations. Verify source citations.
Test edge cases. What happens when the answer isn’t in the knowledge base? Does it say “I don’t know” or make something up?
Day 6-7: Deploy Pilot
Ship it. Internal Slack bot, API endpoint for your customer support tool, web interface for employees. Start small. Measure results. Iterate.
Fast pilots beat slow perfection. Get feedback from real users, fix the obvious problems, deploy version 2.
Production hardening (weeks 2-4):
- Add reranking for better retrieval accuracy
- Implement hybrid search (vector + keyword)
- Add observability and logging
- Scale infrastructure for production load
- Integrate with existing auth and access controls
TMA’s differentiator: We’ve done this before. The methodology is proven. While others are scheduling discovery meetings, we’re processing real queries with real data. That’s the speed advantage.
What Goes Wrong with RAG: TMA Honesty
Here’s what breaks in production and how to fix it.
Chunking Mistakes: Context Loss
The problem: Split a 10-page document every 500 words and you’ll cut critical context in half. A sentence about “the policy applies to all full-time employees” ends up in one chunk while the actual policy details land in the next chunk.
What breaks: Retrieval pulls the policy details but misses the scope limitation. The LLM answers as if the policy applies to everyone.
The fix: Semantic chunking. Split by headers, paragraphs, or natural section breaks. Add overlapping chunks so context bleeds across boundaries. Test retrieval with queries that require multi-paragraph reasoning.
Production pattern: Use RecursiveCharacterTextSplitter with overlap, or build custom chunking logic that respects document structure.
Metadata Failures: Wrong Documents Retrieved
The problem: You retrieve a policy from 2019 when the user needs the updated 2024 version. Or you surface a document the user doesn’t have permission to access.
What breaks: Answers are technically correct but outdated or unauthorized. Compliance nightmare.
The fix: Attach metadata to every chunk (date, version, access level, document type). Filter retrieval by metadata before sending results to the LLM.
Production pattern:
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 10,
"filter": {
"date": {"$gte": "2024-01-01"},
"access_level": {"$in": user.permissions}
}
}
)
Reranking Gaps: Semantic Search Isn’t Enough
The problem: Vector similarity retrieves documents with similar words but different meanings. Query for “Python package management” and get results about shipping packages.
What breaks: Top 5 results look relevant to the embedding model but don’t answer the question.
The fix: Add a reranking step. Pull 20-50 candidates with vector search, then use a cross-encoder (Cohere Rerank, cross-encoder/ms-marco-MiniLM) to rerank by actual relevance to the query.
Production pattern: Retrieval recall goes from 70% to 90%+ with reranking. It’s not optional for high-stakes use cases.
Hallucination Despite RAG
The problem: Even with retrieved context, the LLM makes up details that aren’t in the source documents.
What breaks: User trust. One hallucinated fact undermines credibility of the entire system.
The fix: Prompt engineering and validation. Explicitly tell the model to only use provided context. Add a validation step that checks if the LLM’s answer is supported by the retrieved chunks. If validation fails, return “I don’t have enough information” instead of the generated answer.
Production pattern:
prompt = """Use ONLY the following context to answer the question.
If the answer isn't in the context, respond with:
"I don't have that information in the provided documents."
Do not make up information. Do not use your training data.
Only use the context below.
Context: {context}
Question: {question}
Answer:"""
Query Transformation Gaps
The problem: Users ask vague questions like “What’s the return thing?” The embedding model doesn’t know what “return thing” means. Retrieval fails.
What breaks: Bad questions get bad answers, even if the knowledge exists in your database.
The fix: Query transformation. Rewrite vague queries into specific searches before retrieval.
Production pattern: Use an LLM to expand “What’s the return thing?” into “What is the return policy for damaged or defective items?” Then embed the expanded query.
Scale and Cost Surprises
The problem: Embedding 1M documents costs $100 in API fees. Retrieval adds 100ms latency per query. LLM inference costs stack up at high volume.
What breaks: Proof of concept works great at 100 queries/day. Production deployment at 10K queries/day costs $5K/month and has 2-second response times.
The fix: Batch embedding generation. Use open-source embedding models for cost-sensitive deployments. Cache frequent queries. Optimize LLM usage (smaller models for simple queries, GPT-4 only when needed).
Production pattern: Monitor costs from day one. Most RAG deployments don’t fail technically, they fail economically because no one tracked spend.
Master RAG with Agent Guild
Want to build production RAG systems that actually ship? Join the Agent Guild.
What you get:
- Real-world RAG deployment case studies
- Production-grade architecture patterns
- Access to engineers who’ve shipped 100+ RAG agents
- Weekly build sessions and code reviews
- Shared cost, shared upside on joint ventures
The model: You bring domain expertise and distribution. We bring the AI engineering muscle. We co-build RAG systems for your industry, share costs, share profits. You’re not hiring a dev agency. You’re partnering with builders who’ve done this before.
Who this is for:
- Domain experts with distribution (compliance, legal, medical, finance)
- A-player engineers who want to ship agents full-time
- Founders ready to build AI products, not AI demos
This isn’t a course. It’s a community of builders who ship production agents measured by ROI, not vibes.
RAG Implementation Code: LangChain and LlamaIndex
Here’s working code you can deploy today.
LangChain RAG Pipeline
Full implementation:
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Qdrant
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# 1. Load documents
loader = DirectoryLoader(
'./data',
glob="**/*.txt",
loader_cls=TextLoader
)
documents = loader.load()
# 2. Chunk documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " "]
)
chunks = text_splitter.split_documents(documents)
# 3. Generate embeddings and store in vector database
embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")
vectorstore = Qdrant.from_documents(
documents=chunks,
embedding=embeddings,
url="http://localhost:6333",
collection_name="knowledge_base"
)
# 4. Set up retriever
retriever = vectorstore.as_retriever(
search_type="similarity",
search_kwargs={"k": 5}
)
# 5. Create custom prompt
template = """You are a helpful assistant. Answer the question using only the context provided.
Context: {context}
Question: {question}
If the answer isn't in the context, say: "I don't have that information."
Answer:"""
prompt = PromptTemplate(
template=template,
input_variables=["context", "question"]
)
# 6. Build RAG chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=retriever,
return_source_documents=True,
chain_type_kwargs={"prompt": prompt}
)
# 7. Query the system
result = qa_chain({"query": "What is the return policy?"})
print(f"Answer: {result['result']}")
print(f"Sources: {[doc.metadata for doc in result['source_documents']]}")
LlamaIndex RAG Pipeline
Alternative implementation:
from llama_index import (
VectorStoreIndex,
SimpleDirectoryReader,
ServiceContext,
StorageContext,
)
from llama_index.embeddings import OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index.vector_stores import QdrantVectorStore
import qdrant_client
# 1. Load documents
documents = SimpleDirectoryReader('./data').load_data()
# 2. Set up Qdrant vector store
client = qdrant_client.QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(
client=client,
collection_name="knowledge_base"
)
# 3. Configure service context
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
llm = OpenAI(model="gpt-4", temperature=0)
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
chunk_size=1000,
chunk_overlap=200
)
# 4. Build index
storage_context = StorageContext.from_defaults(
vector_store=vector_store
)
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context,
storage_context=storage_context
)
# 5. Create query engine
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact"
)
# 6. Query the system
response = query_engine.query("What is the return policy?")
print(f"Answer: {response}")
print(f"Sources: {response.source_nodes}")
Production Enhancements
Add reranking with Cohere:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
# Wrap your base retriever with reranking
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 20})
compressor = CohereRerank(model="rerank-english-v2.0", top_n=5)
retriever = ContextualCompressionRetriever(
base_retriever=base_retriever,
base_compressor=compressor
)
Add metadata filtering:
# Filter by date and document type
retriever = vectorstore.as_retriever(
search_kwargs={
"k": 5,
"filter": {
"date": {"$gte": "2024-01-01"},
"type": "policy_document"
}
}
)
Add hybrid search (vector + keyword):
from langchain.retrievers import EnsembleRetriever
from langchain.retrievers import BM25Retriever
# Combine vector search with keyword search
vector_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
keyword_retriever = BM25Retriever.from_documents(chunks, k=5)
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, keyword_retriever],
weights=[0.7, 0.3] # 70% vector, 30% keyword
)
Frequently Asked Questions
What is a RAG system?
A RAG (Retrieval-Augmented Generation) system combines large language models with external knowledge retrieval. Instead of relying only on training data, RAG systems search a knowledge base for relevant information, then use that context to generate accurate, grounded responses.
How does RAG differ from fine-tuning?
RAG retrieves information at query time from a knowledge base. Fine-tuning retrains the model on new data. RAG is faster (days vs. weeks), cheaper (no GPU training costs), and handles dynamic data better (update the database, not the model). Fine-tuning is better for domain-specific language and style consistency.
When should I use RAG instead of prompt engineering?
Use RAG when your knowledge is too large to fit in the context window or changes frequently. Use prompt engineering when you’re formatting output or guiding reasoning. Most production systems use both: RAG provides the facts, prompt engineering structures the response.
What vector databases work best for RAG?
Qdrant, Weaviate, Pinecone, and pgvector (Postgres extension) are the most common. Qdrant and Weaviate offer the best performance for production deployments. Pinecone is fully managed but cloud-only. Pgvector works if you want vector search in your existing Postgres database.
How long does it take to deploy a RAG system?
Working pilots can be done in one week or less for most use cases. Production hardening (reranking, hybrid search, observability) takes 2-4 weeks depending on data complexity. Full deployments with integrations, access controls, and scaling take 4-8 weeks.
What's the cost of running a RAG system?
Embedding generation: $0.0001 per 1K tokens (one-time cost). Vector database hosting: $50-500/month depending on scale. LLM inference: $0.01-0.03 per query (GPT-4). Total monthly cost for 10K queries/month: $200-500. Far cheaper than fine-tuning at $10K-50K per iteration.
Can RAG work with proprietary data?
Yes. That’s the primary use case. RAG keeps data in your infrastructure. You control the vector database, embeddings run on-premise or in your VPC, and only the query and retrieved chunks go to the LLM API (or you can run open-source LLMs locally).
How do I prevent hallucinations in RAG systems?
Prompt engineering: Explicitly tell the model to only use provided context. Retrieval quality: If retrieval is bad, the LLM has nothing accurate to work with. Validation: Check if the generated answer is supported by the retrieved documents. Confidence scoring: Return “I don’t have enough information” when retrieval scores are low.
What chunk size should I use for RAG?
500-1000 words is the standard starting point. Smaller chunks (200-300 words) work for FAQ-style content. Larger chunks (1500-2000 words) work for long-form documents where context matters. Use overlap (200 words) to preserve context across chunk boundaries. Test retrieval quality and adjust.
How do I handle multi-turn conversations in RAG?
Track conversation history and include previous Q&A pairs in the context. Rewrite the current question to be standalone (expand pronouns, add context from previous turns). Example: “What about pricing?” becomes “What is the pricing for the product mentioned earlier?” before retrieval.
What's the difference between RAG and semantic search?
Semantic search retrieves relevant documents. RAG retrieves documents AND generates a synthesized answer. Semantic search returns “here are 5 related articles.” RAG returns “based on these articles, here’s the answer to your question.”
Can I use RAG with open-source models?
Yes. Use open-source embedding models (sentence-transformers, BGE) and open-source LLMs (Llama, Mistral, Falcon). This eliminates API dependencies and keeps everything on-premise. Trade-off: Open-source models are less accurate than GPT-4, but good enough for many use cases.
How do I evaluate RAG quality?
Retrieval quality: Measure precision (are the top results relevant?) and recall (do the results contain the answer?). Generation quality: Human evaluation, automated scoring (BLEU, ROUGE), user feedback (thumbs up/down). Production metric: Does the answer solve the user’s problem?
What's hybrid search in RAG?
Hybrid search combines vector similarity (semantic search) with keyword matching (BM25). Vector search handles “What’s the return policy?” Keyword search handles “Show me documents mentioning SKU-12345.” Combining both improves accuracy by 15-25% over vector search alone.
How does reranking improve RAG?
Reranking pulls 20-50 candidates with fast vector search, then uses a slower but more accurate cross-encoder to find the best 5. Think of it as a two-stage filter: fast-and-loose retrieval, then careful selection. Boosts accuracy without sacrificing speed.
Can RAG cite sources?
Yes. Most RAG implementations return the source documents used to generate the answer. You can display document titles, URLs, page numbers, or timestamps. Critical for enterprise use cases where answers need to be verifiable.
What metadata should I attach to chunks?
Document title and source URL, creation and update timestamps, author or department, document type (policy, FAQ, technical doc), access control tags (who can see this), version numbers. Metadata enables filtering (“show me policies updated in 2024”) and access control.
How do I scale RAG to millions of documents?
Use approximate nearest neighbor search (HNSW, IVF) instead of exact search. Shard your vector database across multiple nodes. Cache frequent queries. Use smaller, faster embedding models. Monitor costs and latency as you scale.
What's the difference between RAG and knowledge graphs?
RAG retrieves unstructured text chunks. Knowledge graphs retrieve structured entities and relationships. Example: RAG retrieves paragraphs about “Apple Inc.” A knowledge graph retrieves (Apple, founded_by, Steve Jobs) and (Apple, headquarters, Cupertino). Some systems combine both.
Can I update RAG knowledge in real-time?
Yes. Add new documents to your vector database and they’re immediately searchable. No model retraining needed. This is RAG’s biggest advantage over fine-tuning.
How do I handle access controls in RAG?
Attach access control metadata to each chunk (department, role, user ID). Filter retrieval by the current user’s permissions. Example: Only retrieve documents tagged “Engineering” for engineers, “Sales” for sales team. Never return chunks the user isn’t authorized to see.
What's the best embedding model for RAG?
OpenAI text-embedding-ada-002 offers the best accuracy for English. Cohere Embed supports 100+ languages. Open-source models (all-MiniLM, BGE) work for on-premise deployments. Domain-specific models (BioBERT for medical, LegalBERT for law) outperform general models in specialized fields.
Can RAG work for customer-facing applications?
Yes, if retrieval quality is high and you add validation to prevent hallucinations. Customer-facing RAG needs reranking, hybrid search, confidence scoring, and extensive testing. Internal tools can tolerate occasional wrong answers. Customer-facing tools can’t.
How do I debug poor RAG retrieval?
Log retrieval scores for every query. Low scores mean poor semantic match. Check if the answer exists in your knowledge base. Review chunk boundaries (are you splitting context?). Try hybrid search or query transformation. Test different embedding models.
What's query transformation in RAG?
Rewriting vague queries into specific searches before retrieval. Example: “What’s the return thing?” becomes “What is the return policy for damaged items?” Improves retrieval accuracy for poorly-phrased questions.
Can I combine RAG with SQL databases?
Yes. Use RAG for unstructured knowledge (documents, FAQs) and SQL for structured data (customer records, transactions). Route queries to the right system based on question type. Example: “What’s our revenue?” hits SQL. “What’s our return policy?” hits RAG.
How do I monitor RAG performance in production?
Log every query, retrieval scores, source documents, LLM response, and user feedback. Track response time (retrieval + generation), success rate (user thumbs up/down), and cost per query. Monitor failed queries (low retrieval scores, negative feedback) and iterate weekly.
What's the ROI of deploying RAG?
Depends on the use case. Customer support: 60-80% faster response times, 40% fewer escalations. Document review: 70% time savings. Knowledge management: 35% reduction in search time. Calculate hours saved × hourly cost to get monthly ROI.
Can RAG replace Google search for internal documents?
Yes. That’s one of the most common enterprise use cases. RAG provides conversational answers, not just document links. Employees ask questions in natural language and get direct answers with source citations.
How do I transition from RAG pilot to production?
Add reranking and hybrid search for better accuracy. Implement access controls and metadata filtering. Set up observability and logging. Load test at production scale. Integrate with existing auth systems. Monitor costs and optimize. Fast pilots ship in one week. Production hardening takes 2-4 weeks.
What happens when RAG can't find the answer?
Return “I don’t have that information in the provided documents” instead of hallucinating. Log the failed query so you can identify missing knowledge and add it to your database. Production systems should say “I don’t know” rather than make up an answer.