Skip to content

RAG pipeline for querying PDF documents using Pinecone and cross-encoder reranking

Notifications You must be signed in to change notification settings

markgewhite/doc_query

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Doc Query - RAG Pipeline Demonstration

A Retrieval-Augmented Generation (RAG) pipeline that enables semantic search and question-answering over PDF documents and Wikipedia articles.

What is RAG?

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models (LLMs) by providing them with relevant context from external documents. Instead of relying solely on the model's training data, RAG:

  1. Retrieves relevant document chunks based on semantic similarity to the user's question
  2. Augments the LLM prompt with these retrieved chunks as context
  3. Generates an answer grounded in the actual document content

This approach allows LLMs to answer questions about private documents, recent information, or specialized content they weren't trained on, while providing source citations for transparency.

Features

  • PDF Document Ingestion: Load and process PDF files using PyPDFLoader
  • Wikipedia Ingestion: Query and load Wikipedia articles directly
  • Text Chunking: Split documents into overlapping chunks for optimal retrieval
  • Vector Embeddings: Generate embeddings using OpenAI's text-embedding-3-small model
  • Pinecone Vector Store: Store and query embeddings in Pinecone's serverless vector database
  • Semantic Search: Find relevant document chunks based on meaning, not just keywords
  • Cross-Encoder Reranking: Two-stage retrieval with reranking for improved accuracy
    • Stage 1: Retrieve top-N candidates using fast vector similarity
    • Stage 2: Rerank using cross-encoder model for precise relevance scoring
  • GPT-4o Integration: Generate answers with source citations (page numbers)

Architecture

User Question
      |
      v
[Vector Similarity Search] --> Retrieve top-30 candidate chunks
      |
      v
[Cross-Encoder Reranking] --> Rerank and select top-10 most relevant
      |
      v
[Prompt Construction] --> Combine question + context + instructions
      |
      v
[GPT-4o] --> Generate answer with citations
      |
      v
Answer with [p. nn] citations

Setup

Prerequisites

  • Python 3.10+
  • Pinecone account (free tier available)
  • OpenAI API key

Environment Setup

  1. Clone and navigate to the project:

    cd doc_query
  2. Create and activate virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt
  4. Configure API keys:

    Create a .env file in the project root:

    OPENAI_API_KEY=your-openai-api-key
    PINECONE_API_KEY=your-pinecone-api-key
    

Running the Notebook

# Register the virtual environment with Jupyter (first time only)
pip install ipykernel
python -m ipykernel install --user --name=doc_query --display-name "Python (doc_query)"

# Start Jupyter
jupyter notebook doc_query.ipynb

Running the Streamlit App

streamlit run app.py

The app will open in your browser at http://localhost:8501.

Usage

Notebook Usage

  1. Place your PDF document in an accessible location
  2. Update the file path in the load_document() call
  3. Run cells sequentially to:
    • Load and chunk the document
    • Create/load Pinecone index with embeddings
    • Query the document with natural language questions

Streamlit App Usage

  1. Upload a PDF document using the file uploader
  2. Wait for the document to be processed and indexed
  3. Enter your question in the text input
  4. View the answer with source citations

Example Queries and Outputs

Using "Treasure Island" PDF as the source document:

Query: "What does Jim fear most?"

Output:

Jim fears torture the most. This is evident from the excerpt where he expresses his fear to the doctor: "but what I fear is torture. If they come to torture me --" [p. 124]. This statement highlights that, despite the various dangers and threats he faces, the prospect of being tortured is what he dreads the most.

Query: "What is Long John Silver's motivation?"

Output:

Long John Silver's primary motivation appears to be the acquisition of treasure. This is evident from his direct statement: "Well, here it is," said Silver. "We want that treasure, and we'll have it -- that's our point!" [p. 80]. His desire for the treasure is further emphasized when he insists on obtaining the chart from the captain: "What I mean is, we want your chart. Now, I never meant you no harm, myself." [p. 80].

Cost Estimate

Costs per document ingestion and query session:

Operation Model Cost
Embeddings text-embedding-3-small ~$0.002 per 100K tokens
LLM Queries GPT-4o ~$0.005 per query (1.5K input + 100 output tokens)
Pinecone Serverless (free tier) Free for small indexes

Example: Processing a 142-page novel (Treasure Island):

  • Embedding cost: ~$0.002 (113K tokens)
  • Per query cost: ~$0.005
  • Total for 10 queries: ~$0.05

Deployment to Streamlit Cloud

  1. Push to GitHub:

    git init
    git add .
    git commit -m "Initial commit"
    git remote add origin https://github.com/yourusername/doc-query.git
    git push -u origin main
  2. Deploy on Streamlit Cloud:

    • Go to share.streamlit.io
    • Click "New app"
    • Connect your GitHub repository
    • Set the main file path to app.py
    • Add secrets in the Streamlit Cloud dashboard:
      • OPENAI_API_KEY
      • PINECONE_API_KEY
  3. Secrets Configuration:

    In Streamlit Cloud, add secrets via the dashboard (Settings > Secrets):

    OPENAI_API_KEY = "your-openai-api-key"
    PINECONE_API_KEY = "your-pinecone-api-key"

Project Structure

doc_query/
├── app.py              # Streamlit web application
├── doc_query.ipynb     # Jupyter notebook with full pipeline
├── requirements.txt    # Python dependencies
├── README.md           # This file
├── .env                # API keys (not committed)
├── .gitignore          # Git ignore rules
└── venv/               # Virtual environment (not committed)

Key Technologies

  • LangChain: Document loaders, text splitters, prompt templates, and chains
  • OpenAI: Embeddings (text-embedding-3-small) and LLM (GPT-4o)
  • Pinecone: Serverless vector database for similarity search
  • Sentence Transformers: Cross-encoder models for reranking
  • Streamlit: Web application framework

License

This project is for educational purposes as part of the ZTM LLM Web Apps course.

About

RAG pipeline for querying PDF documents using Pinecone and cross-encoder reranking

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •