pgvector RAG Ingestion

What

A startup-time CREATE EXTENSION IF NOT EXISTS vector step run before Base.metadata.create_all, followed by an async ingestion script that PDF-chunks, embeds, and writes into a pgvector column. The extension must exist BEFORE the table is created so the Vector column type resolves.

Project Context

In full_project_context_updated.txt, app/database.py::init_db() enables the extension in the same transaction as the table-creation step. ingest.py and app/services/rag_service.py::ingest_pdf consume that extension, splitting a FCA faqs.pdf into chunks and writing into pgvector columns behind the FAQ table. The docker-compose.yml uses the pgvector/pgvector:pg15 image specifically to ship the extension pre-installed in the container.

How

Startup-time extension bootstrap


async def init_db() -> None:
    async with engine.begin() as conn:
        # MUST run before table creation so the vector column type resolves
        await conn.execute(text("CREATE EXTENSION IF NOT EXISTS vector;"))
 
        await conn.run_sync(Base.metadata.create_all)

CREATE EXTENSION IF NOT EXISTS is idempotent; it is safe to call on every startup. The transaction (engine.begin()) ensures the extension is committed before the metadata create-all sees the new column type.
text(...) is the SQLAlchemy 2.0 wrapper around raw SQL string execution — bare strings are no longer accepted.
conn.run_sync(Base.metadata.create_all) is the bridge for the sync metadata API: SQLAlchemy metadata is sync, but the connection it runs on is async.

Async ingestion entry point


async def run_ingestion():
    rag = RAGService()
    if not os.path.exists("data/FCA faqs.pdf"):
        print("File not found; save PDF inside 'data' folder.")
        return
    try:
        chunks = await rag.ingest_pdf("data/FCA faqs.pdf")
        print(f"Ingested {chunks} document chunks into pgvector.")
    except Exception as e:
        print(f"Failed: {e}")

RAGService() is constructed per call — its DB session is created inside ingest_pdf and closed before return, so no connection leaks across runs.
The PDF path is constant; in production this would come from an env var or S3/GCS URL.

Common Pitfalls

Creating tables before the extension raises type "vector" does not exist because column type references an extension that is not yet loaded. Always enable in the same transaction before metadata.create_all.

Confusing extension name casing — CREATE EXTENSION vector (lowercase) is the canonical spelling. Use uppercase VECTOR only if your role was granted access to a custom alias.

Real-World Interview Prep

Q1: When would you choose pgvector over a dedicated vector DB (Pinecone, Weaviate, Qdrant)?

A: Three signals favour pgvector. (1) Your vectors live next to the relational data you re-rank against (e.g., user-owned chunks joined with customer rows in the same query — pgvector lets SQL do it). (2) You have an existing Postgres operational footprint and don’t want to operate a second database. (3) Your corpus is < ~10M vectors and you can tolerate ~50ms @ k=10 retrieval. Choose a dedicated VDB when (a) corpus > 50M, (b) you need HNSW at 100M+ scale, (c) you need sharding/replication that Postgres doesn’t give you cleanly. Hybrid pattern: pgvector for hot filtered retrieval + Weaviate for cold batch search.

Q2: Walk through your chunking strategy for an FCA FAQ PDF.

A: Four rules. (1) Chunk size 500-1000 tokens, overlap 10-15%; too small = loses context, too big = dilutes the match signal. (2) Chunk on semantic boundaries, not just character count — RecursiveCharacterTextSplitter(separators=["\n\n", "\n", ". ", " "]) is the right default. (3) Preserve metadata (page number, section header) as sidecars so the LLM can cite Page 4 of the FAQ without re-prompting. (4) Deduplicate near-identical chunks before embedding (<5% cosine distance → drop) — FAQ PDFs repeat content; without dedup you waste embedding tokens and skew ANN search. The FCA ingestion script reads a single PDF and chunks by paragraph fallback; for production, swap in the una-semantic splitter and a dedup step before ingest_pdf.

Q3: How do you debug “my RAG returns empty results” on production?

A: Walk three layers. (1) Verify rows exist: SELECT COUNT(*) FROM faq WHERE embedding IS NOT NULL; — emptiness here means the ingestion script failed silently. (2) Verify the query embedding matches the stored embeddings by computing cosine on a sample you know should match. (3) Verify the SQL filter hits — WHERE category = 'fees' may be wrong; check the chunk’s metadata. (4) Check the prompt: is the agent sending the raw user message or the embedding of a rewritten query? Query rewriting (HyDE, multi-query) controls retrieval quality more than the index does. Without HyDE you rely on exact phrasing similarity, which is brittle for natural-language queries.

Top-to-Bottom Code Walkthrough (`app/services/rag_service.py` + `ingest.py`)

RAG (Retrieval-Augmented Generation) is the pattern of “bolt documents onto an LLM so it can answer questions without hallucinating”. pgvector is the Postgres-native way to store and search vector embeddings.

`ingest.py`

The one-shot script:


import asyncio
from app.services.rag_service import RAGService
 
async def run_ingestion():
    rag = RAGService()
    chunks = await rag.ingest_pdf("data/FCA faqs.pdf")
    print(f"Ingested {chunks} chunks")
 
if __name__ == "__main__":
    asyncio.run(run_ingestion())

Run with python ingest.py before starting the app. This populates the document_chunks table.

`RAGService` (`app/services/rag_service.py`)

Constructor:

Constructs an async SQLAlchemy session via AsyncSessionLocal.
Lazily loads the embedding model — either HuggingFace sentence-transformers/all-MiniLM-L6-v2 (local, free) or a remote model via Groq’s embedding endpoint.

ingest_pdf(file_path) -> int — the bulk-load path:

reader = PyPDF2.PdfReader(file_path) — page-by-page text extractor.
For each page, text = page.extract_text() — runs OCR-style extraction on the PDF text layer (won’t OCR scanned images).
Chunk the text: 500-token chunks with 50-token overlap. The overlap is critical — without it, a sentence that straddles chunks is invisible to retrieval.
Embed each chunk: vector = embedding_model.encode(chunk_text) — produces a 384-dimensional float list (for MiniLM-L6-v2).

Insert into Postgres:


await session.execute(text("""
  INSERT INTO document_chunks (source, page, content, embedding)
  VALUES (:source, :page, :content, :embedding::vector)
"""), {"source": "FCA faqs.pdf", "page": page_num, "content": chunk_text, "embedding": vector})

The ::vector cast tells Postgres “treat this Python list[float] as a vector(384) column”.

Commit and return chunk count.

The vector extension must exist before any vector(N) column:


CREATE EXTENSION IF NOT EXISTS vector;

This SQL is in app/database.py’s init_db() — it runs before create_all.

query(question, top_k=5) -> list[Chunk] — the search path:

Embed the question.

Run pgvector’s cosine distance operator:


SELECT content, source, page, embedding <=> :q_vec AS distance
FROM document_chunks
ORDER BY embedding <=> :q_vec
LIMIT 5

<=> is cosine distance (smaller = more similar).

Returns the top-k chunks with their distance scores.

build_prompt(question, chunks) -> str — formats the prompt for the LLM:


prompt = f"""Answer the question using ONLY the context below.
If the answer is not in the context, say "I don't know".
 
Question: {question}
 
Context:
{chr(10).join(f"- {c.content}" for c in chunks)}
"""

The “ONLY the context” instruction reduces hallucination. Without it, the LLM will sometimes invent answers even with relevant context in front of it.

Common Pitfalls

Running ingest.py twice without DELETE FROM document_chunks WHERE source = 'X' creates duplicates. Make ingestion idempotent via UPSERT or explicit clear-first.

Using the same embedding model for ingestion and query is mandatory — mixing models returns garbage (384-dim vector vs 768-dim vector).

Embedding length mismatch — pgvector column type must match the model’s output dimension. MiniLM-L6-v2 = 384; all-mpnet-base-v2 = 768.

Real-World Interview Prep

Q1: Why pgvector instead of Pinecone / Weaviate?

A: Data sovereignty + cost + simplicity. FCA-grade data must stay in your DB. pgvector keeps embeddings with their row, lets you JOIN on metadata, and runs vector search in the same query plan as your other filters.

Q2: When is cosine vs L2 vs inner-product the right metric?

A: Use cosine (default) when document lengths vary — it ignores magnitude. Use inner product when your embeddings are normalised. Never L2 unless you specifically know why.

Q3: How do you keep ingested embeddings in sync with source documents?

A: Hash the source PDF’s content; if the hash differs, re-ingest. Store the hash in the document_chunks.source_metadata column. Cron job compares daily.

pgvector RAG Ingestion

What

Project Context

How

Startup-time extension bootstrap

Async ingestion entry point

Common Pitfalls

Real-World Interview Prep

Q1: When would you choose pgvector over a dedicated vector DB (Pinecone, Weaviate, Qdrant)?

Q2: Walk through your chunking strategy for an FCA FAQ PDF.

Q3: How do you debug “my RAG returns empty results” on production?

Top-to-Bottom Code Walkthrough (app/services/rag_service.py + ingest.py)

ingest.py

RAGService (app/services/rag_service.py)

Common Pitfalls

Real-World Interview Prep

Q1: Why pgvector instead of Pinecone / Weaviate?

Q2: When is cosine vs L2 vs inner-product the right metric?

Q3: How do you keep ingested embeddings in sync with source documents?

Top-to-Bottom Code Walkthrough (`app/services/rag_service.py` + `ingest.py`)

`ingest.py`

`RAGService` (`app/services/rag_service.py`)