Building a RAG System from Scratch: A Beginner’s Guide
Remember when you had to manually search through tons of documents to find that one specific piece of information? Yeah, those days are over.
In this tutorial, we’ll build a RAG (Retrieval-Augmented Generation) system from scratch. By the end, you’ll have a working system that can answer questions about your documents like magic.
New to RAG? If you want to understand what RAG is and why it’s useful before we dive in, check out my guide: RAG for Beginners: A Simple Guide. It’s a quick read that explains the concept in super simple terms!
We’ll use:
- Docling - to process and understand documents
- LanceDB - to store and search through document chunks
- OpenAI - for embeddings and AI generation
- Streamlit - for the chat interface
Let’s build something cool together!
What We’re Building
We’re going to create a system that can:
- Take a bunch of PDF/Markdown documents (company policies, manuals, etc.)
- Process them into searchable chunks
- Answer questions about those documents through a chat interface
Think of it as giving AI a superpower to instantly find and cite information from your documents.
Prereqs
Before we start, make sure you have:
- Python 3.10 or higher installed
- OpenAI API Key - Get yours at platform.openai.com
- Set
export OPENAI_API_KEY="your-api-key-here"
Architecture Overview
Our RAG system works in 5 simple steps:
- Extraction - Convert documents to a format we can process
- Chunking - Split documents into smaller, manageable pieces
- Embedding - Convert chunks to vectors and store in database
- Query - Search for relevant chunks based on user questions
- Chat - Generate AI responses using the retrieved context
Step 1: Document Extraction
First, we need to convert our documents (PDF or Markdown) into a format that docling can understand:
from docling.document_converter import DocumentConverter
# Convert a document
converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")
That’s it! Docling handles:
- Text content extraction
- Page number tracking
- Heading identification
- Table parsing
- Image handling
The result is a structured document object that preserves all this information.
Step 2: Intelligent Chunking
Documents are too large to process all at once. We split them into smaller pieces:
from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer
# Create a chunker with 128 tokens per chunk
tokenizer = HuggingFaceTokenizer(tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"), max_tokens=128)
chunker = HybridChunker(tokenizer=tokenizer, max_tokens=128)
# Split the document
chunks = list(chunker.chunk(dl_doc=result.document))
The HybridChunker intelligently splits at logical boundaries (paragraphs, sections) rather than randomly cutting text.
Step 3: Creating Vector Embeddings
We convert text into numbers (vectors) that capture meaning:
import lancedb
from lancedb.embeddings import get_registry
# Setup OpenAI embeddings
func = get_registry().get("openai").create(name="text-embedding-3-small")
# Create LanceDB database
db = lancedb.connect("data/lancedb")
table = db.create_table("documents", schema=Chunks, mode="overwrite")
What are embeddings?
- Similar texts have similar numbers
- This lets us search semantically (“hotels” finds “accommodation”)
- LanceDB handles all the complexity automatically
Step 4: Storing with Metadata
We store each chunk along with important information:
processed_chunks = [
{
"text": chunk.text,
"metadata": {
"filename": chunk.meta.origin.filename,
"page_numbers": [...], # List of page numbers
"title": chunk.meta.headings[0] if chunk.meta.headings else None,
},
}
for chunk in chunks
]
table.add(processed_chunks)
Why metadata matters:
- Users can see where information came from
- Enables citations (e.g., “See page 42 of network-config.pdf”)
Step 5: Querying the Database
When a user asks a question, we find relevant chunks:
# Search for relevant chunks
results = table.search(question).limit(20).to_pandas()
# Build context with citations
context = ""
for _, row in results.iterrows():
source = f"\nSource: {row['metadata']['filename']}"
if row['metadata']['page_numbers']:
source += f" - p. {', '.join(str(p) for p in row['metadata']['page_numbers'])}"
context += row['text'] + source + "\n\n"
This gives us:
- Top 20 most relevant chunks
- With file names and page numbers
- In order of relevance
Step 6: Generating AI Responses
Now we ask the AI to answer based on what we found:
from openai import OpenAI
client = OpenAI()
# Create a prompt with the context
prompt = f"""Answer the question using only this context:
{context}
Question: {question}
"""
# Get response from GPT
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0.5
)
answer = response.choices[0].message.content
The AI:
- Reads only the provided context
- Answers based on document evidence
- Includes citations naturally
- Uses a lower temperature (0.5) for factual responses
Step 7: Building the Chat Interface
We use Streamlit to create a user-friendly interface:
import streamlit as st
# Display chat history
for message in st.session_state.messages:
with st.chat_message(message["role"]):
st.markdown(message["content"])
# Get user input
if prompt := st.chat_input("Ask a question"):
# Add to chat history
st.session_state.messages.append({"role": "user", "content": prompt})
# Get context from database
context = get_context(prompt, table)
# Generate response
response = get_chat_response(st.session_state.messages, context)
# Display response
with st.chat_message("assistant"):
st.markdown(response)
This creates:
- A clean chat interface
- Real-time streaming responses
- Conversation history
- Visual search results
The complete code is available on my github
Try it out and make it your own!