Building RAG from Scratch — Embeddings, pgvector, and a Bug Worth Knowing
RAG sounds complex until you break it into its actual steps:
Query → embed query → search vector store → retrieve top N chunks → prompt + chunks → generate
At its core, it's a retrieval problem with a generation step at the end. The model doesn't have access to your data — it reasons over whatever you include in the prompt. RAG is the mechanism for deciding what to include.
What Embeddings Actually Are
An embedding is a numerical representation of text — a list of floats (a vector) that captures semantic meaning. Text with similar meaning produces vectors that are close together in high-dimensional space.
This is what makes semantic search work. When you embed a query and search for the stored vectors nearest to it, "nearest" means semantically similar — not lexically similar. The query "how do I cancel my subscription" will find documents about "account cancellation" and "ending a membership" even if neither phrase appears in the query.
Traditional keyword search matches words. Embedding-based search matches meaning. That distinction matters a lot when users phrase things differently than your documentation does.
The Stack
For a basic RAG implementation in TypeScript/Node.js:
Anthropic API for the generation step
OpenAI embeddings for creating and querying vectors (Anthropic's SDK doesn't expose an embeddings API, so OpenAI fills that gap)
pgvector on PostgreSQL for the vector store
NDJSON streaming to push results to the client incrementally
The pgvector setup is straightforward. It's a Postgres extension that adds a vector column type and similarity search operators. You store your document chunks with their embeddings, then query for the closest matches at retrieval time.
Balancing the Similarity Threshold
Every RAG implementation needs a similarity threshold — a cutoff below which retrieved chunks are considered too dissimilar to be relevant.
Setting this wrong in either direction causes real problems:
Too high: You filter out chunks that are relevant but not close to an exact phrasing match. The model gets less context than it should and either makes things up or says it doesn't know.
Too low: You retrieve chunks that aren't genuinely relevant, adding noise that degrades the quality of the generated response. And it costs tokens.
There's no universal right answer here. The threshold needs to be tuned against real queries from your specific use case. Start conservative (higher threshold, fewer results) and loosen it as you observe misses.
The pgvector Bug That Trips You Up
Here's the production debugging story that makes this post worth reading.
When I was building the RAG module, I hit a case where the similarity search was returning empty results even for queries that clearly matched stored documents. The data was there. The embeddings were correct. The query looked right.
The culprit: a known pgvector behavior where referencing the same parameterized vector expression more than once in a single query causes it to return nothing.
This query fails silently:
SELECT id, content
FROM documents
WHERE 1 - (embedding <=> \(1::vector) > \)2
ORDER BY embedding <=> $1::vector
The vector $1::vector is referenced twice — once in the WHERE clause and once in the ORDER BY. pgvector evaluates it twice, and the second evaluation returns empty.
The fix is a subquery that evaluates the expression once and references the result:
SELECT id, content, similarity
FROM (
SELECT id, content, 1 - (embedding <=> $1::vector) AS similarity
FROM documents
) AS ranked
WHERE similarity > $2
ORDER BY similarity DESC
This pattern evaluates the vector expression a single time in the inner query, then filters and sorts against the pre-computed similarity score in the outer query. The results come back correctly.
The practical rule: never reference the same parameterized vector more than once in a single pgvector query.
NDJSON for Streaming Mixed Content
Basic streaming pushes raw text strings to the client. RAG adds a retrieval step before generation — and the client needs to know about both. What got retrieved? When does generation start?
The answer is NDJSON (newline-delimited JSON): each chunk pushed through the stream is a JSON object with a type field:
// Retrieval result
controller.enqueue(encoder.encode(JSON.stringify({ type: "sources", data: retrievedChunks }) + "\n"));
// Generated text
controller.enqueue(encoder.encode(JSON.stringify({ type: "text", delta: chunk.delta.text }) + "\n"));
The client splits incoming data on newlines and parses each line independently. A partial reader.read() result gets buffered until the next \n arrives. This is also why TextEncoder becomes necessary here — ReadableStream expects Uint8Array, and NDJSON requires explicit encoding rather than relying on runtime tolerance for plain strings.
The Broader Pattern
The pgvector bug is a good example of a class of problems that's common in applied AI work: the integration layer between the model and your data infrastructure has its own failure modes that have nothing to do with the model. Debugging them requires treating each layer (the embedding generation, the vector store query, the retrieval pipeline) as independently testable components.
In production RAG systems, most failures happen in retrieval, not generation. The model does a reasonable job if given good context. The hard part is reliably getting it that context.

