15 minute read

Remember when you had to manually search through tons of documents to find that one specific piece of information? Yeah, those days are over.

In this tutorial, we’ll build a RAG (Retrieval-Augmented Generation) system from scratch. By the end, you’ll have a working system that can answer questions about your documents like magic.

New to RAG? If you want to understand what RAG is and why it’s useful before we dive in, check out my guide: RAG for Beginners: A Simple Guide. It’s a quick read that explains the concept in super simple terms!

We’ll use:

  • Docling - to process and understand documents
  • LanceDB - to store and search through document chunks
  • OpenAI - for embeddings and AI generation
  • Streamlit - for the chat interface

Let’s build something cool together!

What We’re Building

We’re going to create a system that can:

  1. Take a bunch of PDF/Markdown documents (company policies, manuals, etc.)
  2. Process them into searchable chunks
  3. Answer questions about those documents through a chat interface

Think of it as giving AI a superpower to instantly find and cite information from your documents.

Prereqs

Before we start, make sure you have:

  1. Python 3.10 or higher installed
  2. OpenAI API Key - Get yours at platform.openai.com
  3. Set export OPENAI_API_KEY="your-api-key-here"

Architecture Overview

Our RAG system works in 5 simple steps:

  1. Extraction - Convert documents to a format we can process
  2. Chunking - Split documents into smaller, manageable pieces
  3. Embedding - Convert chunks to vectors and store in database
  4. Query - Search for relevant chunks based on user questions
  5. Chat - Generate AI responses using the retrieved context

Step 1: Document Extraction

First, we need to convert our documents (PDF or Markdown) into a format that docling can understand:

from docling.document_converter import DocumentConverter

# Convert a document
converter = DocumentConverter()
result = converter.convert("path/to/document.pdf")

That’s it! Docling handles:

  • Text content extraction
  • Page number tracking
  • Heading identification
  • Table parsing
  • Image handling

The result is a structured document object that preserves all this information.

Step 2: Intelligent Chunking

Documents are too large to process all at once. We split them into smaller pieces:

from docling.chunking import HybridChunker
from docling_core.transforms.chunker.tokenizer.huggingface import HuggingFaceTokenizer

# Create a chunker with 128 tokens per chunk
tokenizer = HuggingFaceTokenizer(tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"), max_tokens=128)
chunker = HybridChunker(tokenizer=tokenizer, max_tokens=128)

# Split the document
chunks = list(chunker.chunk(dl_doc=result.document))

The HybridChunker intelligently splits at logical boundaries (paragraphs, sections) rather than randomly cutting text.

Step 3: Creating Vector Embeddings

We convert text into numbers (vectors) that capture meaning:

import lancedb
from lancedb.embeddings import get_registry

# Setup OpenAI embeddings
func = get_registry().get("openai").create(name="text-embedding-3-small")

# Create LanceDB database
db = lancedb.connect("data/lancedb")
table = db.create_table("documents", schema=Chunks, mode="overwrite")

What are embeddings?

  • Similar texts have similar numbers
  • This lets us search semantically (“hotels” finds “accommodation”)
  • LanceDB handles all the complexity automatically

Step 4: Storing with Metadata

We store each chunk along with important information:

processed_chunks = [
    {
        "text": chunk.text,
        "metadata": {
            "filename": chunk.meta.origin.filename,
            "page_numbers": [...],  # List of page numbers
            "title": chunk.meta.headings[0] if chunk.meta.headings else None,
        },
    }
    for chunk in chunks
]

table.add(processed_chunks)

Why metadata matters:

  • Users can see where information came from
  • Enables citations (e.g., “See page 42 of network-config.pdf”)

Step 5: Querying the Database

When a user asks a question, we find relevant chunks:

# Search for relevant chunks
results = table.search(question).limit(20).to_pandas()

# Build context with citations
context = ""
for _, row in results.iterrows():
    source = f"\nSource: {row['metadata']['filename']}"
    if row['metadata']['page_numbers']:
        source += f" - p. {', '.join(str(p) for p in row['metadata']['page_numbers'])}"
    context += row['text'] + source + "\n\n"

This gives us:

  • Top 20 most relevant chunks
  • With file names and page numbers
  • In order of relevance

Step 6: Generating AI Responses

Now we ask the AI to answer based on what we found:

from openai import OpenAI

client = OpenAI()

# Create a prompt with the context
prompt = f"""Answer the question using only this context:

{context}

Question: {question}
"""

# Get response from GPT
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.5
)

answer = response.choices[0].message.content

The AI:

  • Reads only the provided context
  • Answers based on document evidence
  • Includes citations naturally
  • Uses a lower temperature (0.5) for factual responses

Step 7: Building the Chat Interface

We use Streamlit to create a user-friendly interface:

import streamlit as st

# Display chat history
for message in st.session_state.messages:
    with st.chat_message(message["role"]):
        st.markdown(message["content"])

# Get user input
if prompt := st.chat_input("Ask a question"):
    # Add to chat history
    st.session_state.messages.append({"role": "user", "content": prompt})

    # Get context from database
    context = get_context(prompt, table)

    # Generate response
    response = get_chat_response(st.session_state.messages, context)

    # Display response
    with st.chat_message("assistant"):
        st.markdown(response)

This creates:

  • A clean chat interface
  • Real-time streaming responses
  • Conversation history
  • Visual search results

The complete code is available on my github

Try it out and make it your own!