RAG in Production: Retrieval, Chunking, Embeddings, and Evaluation

Introduction: RAG Is More Than a Vector Search Demo

Retrieval-Augmented Generation, usually shortened to RAG, is a common pattern for building AI applications that need access to external knowledge.

Instead of relying only on what a language model already knows, the system retrieves relevant information from documents, databases, or internal knowledge sources, then provides that context to the model before it generates an answer.

At demo level, RAG often looks simple:

Upload some documents.
Split them into chunks.
Embed them.
Store them in a vector database.
Ask questions over them.

That flow is useful for proving the idea, but it hides most of the engineering work required to make the system reliable.

In production, RAG is not just “put documents into a vector database”. It is a full retrieval system with data pipelines, indexing decisions, ranking logic, metadata filters, context construction, evaluation, monitoring, and failure handling.

The quality of the final answer depends on many steps before the model ever sees the prompt.

A good RAG system has to answer practical engineering questions:

Were the documents ingested correctly?
Are the chunks too small, too large, or missing important context?
Are the embeddings suitable for the domain?
Did retrieval return the right evidence?
Is the answer grounded in that evidence?
What happens when documents become stale, permissions change, retrieval fails, or latency becomes too high?

These are backend engineering problems as much as AI problems.

The model may generate the final response, but the system around it determines whether the response is useful, traceable, and safe to rely on.

This blog breaks down RAG from a production engineering perspective: ingestion, chunking, embeddings, retrieval quality, context construction, evaluation, observability, latency, cost, and common failure cases.

The goal is not to make RAG sound magical. The goal is to show what has to be designed, measured, and maintained when RAG becomes part of a real application.

The Production RAG Pipeline

A production RAG system usually has two sides:

An offline ingestion pipeline.
An online query path.

The ingestion pipeline prepares knowledge for retrieval. It takes raw documents or records, cleans them, splits them into chunks, generates embeddings, and stores them in a search index.

The query path runs when a user asks a question. It embeds the query, retrieves relevant chunks, builds a context window, sends that context to the model, and returns an answer.

A simplified production RAG flow looks like this:

flowchart TD
    A[Documents and Data Sources] --> B[Ingestion Pipeline]
    B --> C[Cleaning and Preprocessing]
    C --> D[Chunking]
    D --> E[Embedding Generation]
    E --> F[Vector Database or Search Index]

    G[User Query] --> H[Query Embedding]
    H --> I[Retriever]
    F --> I
    I --> J[Context Construction]
    J --> K[LLM Response]
    K --> L[Logging]
    L --> M[Evaluation and Monitoring]
    M --> B

The important point is that each stage affects the final answer.

If ingestion drops tables, loses metadata, or parses documents incorrectly, retrieval quality suffers. If chunking breaks useful meaning across boundaries, the model may receive incomplete context. If embeddings are weak for the domain, similar-looking but irrelevant chunks may rank above the correct evidence. If context construction includes too much noisy text, the model may ignore the useful parts.

This is why RAG should be treated like a backend system, not just an AI feature.

The pipeline needs:

Clear data contracts.
Repeatable ingestion jobs.
Versioned indexes.
Permission handling.
Monitoring.
A way to debug individual answers.

For example, when a user receives a wrong answer, the useful debugging question is not only:

Did the model hallucinate?

A better set of questions is:

Was the correct document ingested?
Was the relevant section chunked properly?
Did retrieval return that chunk?
Was it included in the final context?
Did the model ignore it, misread it, or lack enough evidence?

Those questions separate retrieval failures from generation failures.

That separation matters because each failure needs a different fix. A better prompt will not solve missing documents. A larger model will not fix stale embeddings. A reranker will not help if the ingestion pipeline removed the key section during preprocessing.

A production RAG pipeline should therefore be designed with traceability.

For each answer, the system should be able to record:

User query.
Retrieved chunks.
Source documents.
Metadata filters.
Ranking scores.
Final prompt context.
Model output.
Latency.
User feedback.

Without that trace, improving RAG becomes guesswork.

The model is only one part of the pipeline. In many real systems, the reliability of RAG depends more on the quality of ingestion, retrieval, context assembly, and evaluation than on the model itself.

Ingestion and Preprocessing

RAG systems are only as reliable as the knowledge they index.

Before chunking, embeddings, or vector search matter, the system needs to ingest the right data in the right shape.

In a demo, ingestion might mean uploading a few PDFs into a vector database. In production, ingestion usually involves many different sources:

PDFs.
HTML pages.
Markdown files.
Internal documentation.
Support tickets.
Product specs.
Database records.
Transcripts.
Generated reports.

Each source has its own structure, formatting issues, update frequency, and access rules.

The first challenge is parsing.

Documents are not always clean text. PDFs may contain headers, footers, page numbers, tables, columns, images, and repeated boilerplate. HTML pages may include navigation menus, adverts, sidebars, and hidden content. Internal documents may contain outdated sections, duplicated pages, or inconsistent headings.

If this content is parsed poorly, the retrieval layer indexes noise. The model may later receive chunks that look relevant but are actually incomplete, duplicated, or stripped of important context.

Preprocessing is the step that turns raw content into retrievable knowledge.

This usually includes:

Removing irrelevant boilerplate.
Normalising whitespace and formatting.
Preserving headings and section boundaries.
Extracting tables where possible.
Attaching metadata.
Removing duplicates.
Detecting stale or archived content.
Applying access control rules.

The metadata is especially important.

A chunk of text is much more useful when the system also knows its source document, section title, created date, updated date, owner, document type, permissions, and version.

Without metadata, retrieval becomes a flat search over text. With metadata, the system can filter and rank results more intelligently.

For example, a query about an internal API should probably favour current engineering documentation over an old Slack export. A compliance question may need the latest approved policy, not a draft from two years ago. A customer-specific query should only retrieve documents that the user is authorised to see.

A simple internal document record might look like this:

{
  "document_id": "api_authentication_v4",
  "title": "Authentication API",
  "document_type": "engineering_docs",
  "owner": "platform-team",
  "version": "4.0",
  "updated_at": "2026-05-12",
  "status": "current",
  "permissions": ["engineering", "support"],
  "source_url": "internal://docs/api/authentication"
}

The chunk text matters, but the metadata around it is what allows the retrieval system to decide whether that chunk is current, relevant, and authorised.

This is where RAG starts to look like normal backend engineering.

The ingestion layer needs data contracts, validation, logging, retries, and failure handling. If a document fails to parse, the system should not silently ignore it. If an index update fails halfway through, the system needs a safe recovery path. If documents are updated frequently, the pipeline needs a strategy for refreshing embeddings and removing stale chunks.

A common production mistake is treating ingestion as a one-time setup task.

In reality, ingestion is continuous. Documents change, permissions change, product behaviour changes, and old knowledge becomes unsafe to retrieve. A RAG system that does not handle freshness will slowly drift away from the real state of the business.

Good preprocessing does not guarantee good answers, but poor preprocessing almost guarantees bad ones.

If the system indexes messy, stale, or permissionless data, the model will eventually expose that weakness in the final response.

The retrieval layer can only search what the ingestion pipeline made searchable.

Production RAG begins with a simple principle:

Before improving the model, make sure the knowledge layer is clean, current, structured, and traceable.

Chunking Strategies and Trade-offs

After documents are ingested and cleaned, the next question is how to split them into retrievable units.

These units are usually called chunks.

A chunk is a section of text that gets embedded, stored, retrieved, and later passed into the model as context. The size and structure of chunks matter because retrieval happens at the chunk level. If the chunk is badly formed, the retrieval system may return incomplete, noisy, or misleading context.

At a high level, chunking has to balance two competing goals.

Small chunks improve precision.

They make it easier to retrieve a focused piece of information.

Larger chunks preserve context.

They reduce the chance that an important explanation is split across multiple pieces.

The problem is that production systems usually need both. A small chunk may contain the exact sentence needed to answer a question, but not enough surrounding context to interpret it. A large chunk may preserve the full meaning, but it may also contain irrelevant text that weakens retrieval and wastes tokens.

There is no universal best chunk size.

The right strategy depends on the document type, query patterns, model context window, latency budget, and how much structure exists in the source data.

Chunking strategy	How it works	Strength	Weakness	Best used for
Fixed-size chunks	Splits text by token or character count	Simple and predictable	Can break meaning across boundaries	Large, uniform documents
Section-aware chunks	Splits by headings, sections, or document structure	Preserves natural meaning	Depends on clean document structure	Docs, policies, reports
Semantic chunks	Splits based on topic or meaning	Better context boundaries	More complex and slower to implement	Knowledge-heavy content
Overlapping chunks	Repeats some text between neighbouring chunks	Reduces boundary loss	Increases storage and duplicate retrieval	Long explanatory documents
Parent-child chunks	Retrieves small chunks, then includes a larger parent section	Balances precision and context	Adds system complexity	Technical docs, contracts, research

Fixed-size chunking is often the easiest starting point. For example, a system may split text into 500-token chunks with 50 tokens of overlap. This is simple to implement and works reasonably well for many documents.

The downside is that it does not understand structure. It may split a paragraph, table, code example, or policy clause in the middle.

A simple fixed-size chunking function might look like this:

def chunk_tokens(tokens: list[str], chunk_size: int = 500, overlap: int = 50):
    if overlap >= chunk_size:
        raise ValueError("overlap must be smaller than chunk_size")

    start = 0

    while start < len(tokens):
        end = start + chunk_size
        yield tokens[start:end]
        start = end - overlap

This is useful as a baseline, but production systems often need more document-aware logic.

Section-aware chunking is usually better when documents have clear headings. Instead of cutting every 500 tokens, the system tries to preserve logical sections such as “Authentication,” “Refund Policy,” or “Deployment Steps.” This makes retrieved chunks easier for the model to use because the chunk already has a meaningful boundary.

Semantic chunking goes a step further. It tries to split content based on meaning rather than formatting. This can improve retrieval quality, but it also introduces more complexity. The system may need additional models, heuristics, or preprocessing steps to decide where one topic ends and another begins.

Overlap is commonly used to reduce boundary problems. If one chunk ends just before an important sentence, the next chunk may include some repeated text from the previous one.

This helps preserve continuity, but too much overlap creates duplicate content, increases storage cost, and can cause the retriever to return near-identical chunks.

Parent-child retrieval is a useful production pattern. The system embeds and retrieves smaller child chunks for precision, but when a child chunk is selected, it includes the larger parent section in the final context.

This helps when the exact match is small, but the model needs a wider explanation to answer properly.

Chunking also affects evaluation. If the correct information exists in the source document but is split badly, retrieval may appear to fail even though the data was technically indexed. In that case, changing the model or prompt will not fix the problem. The chunking strategy needs to change.

A good production approach is to test chunking against real user questions.

For each question, check whether the system retrieves the evidence a human would expect:

If the answer needs three neighbouring chunks to make sense, the chunks may be too small.
If every retrieved chunk contains too much irrelevant text, the chunks may be too large.
If source headings are missing, the chunks may be losing useful structure.
If duplicates dominate the top results, the overlap may be too aggressive.

Chunking should not be treated as a one-time decision. As documents, queries, and use cases change, the chunking strategy may need to evolve.

In production RAG, chunking is part of the retrieval design, not just a preprocessing detail.

Embeddings, Vector Search, and Retrieval Quality

Once documents have been cleaned and chunked, each chunk needs to be made searchable.

In most RAG systems, this is done using embeddings.

An embedding is a numerical representation of text. Instead of storing only the words in a chunk, the system converts the chunk into a vector that captures its meaning. A user query can also be converted into a vector. The retrieval system then looks for chunks whose vectors are close to the query vector.

This is useful because users do not always ask questions using the exact words found in the document.

A user might ask:

How do I reset my account access?

The relevant document might say:

Users can recover login permissions through the identity management portal.

A keyword search may miss that match. An embedding search has a better chance of finding it because the meanings are related, even if the wording is different.

The common production flow looks like this:

Embed the user query.
Search the vector index for similar chunks.
Apply metadata filters.
Rerank the candidate chunks.
Select the best evidence for the model context.

The basic version of this is usually called vector search.

The system compares the query embedding with stored chunk embeddings and returns the nearest matches. In larger systems, this is often done using Approximate Nearest Neighbour search, or ANN, which finds close matches efficiently without comparing against every vector one by one.

But vector similarity is not the same as answer quality.

A chunk can be semantically similar to the question but still not be the right evidence. It might be outdated, too generic, duplicated, from the wrong product version, or missing the specific detail needed to answer.

This is why production retrieval needs more than top_k=5.

A stronger retrieval system usually combines several techniques.

Metadata filtering limits the search space before or during retrieval. For example, the system may filter by customer, document type, product version, region, permission group, publication status, or update date. This helps prevent the model from seeing irrelevant or unauthorised content.

Hybrid search combines vector search with keyword search. This is useful when exact terms matter, such as API names, error codes, legal clauses, product IDs, database fields, or configuration keys.

Vector search is good at meaning. Keyword search is good at exact matching. Production systems often need both.

Reranking adds another scoring step after initial retrieval. The vector database may return the top 20 or 50 candidate chunks, then a reranker evaluates which chunks are most relevant to the actual query.

This can improve precision because the first retrieval stage is optimised for speed, while reranking can focus more carefully on relevance.

A simplified retrieval flow might look like this:

async def retrieve_context(query: str, user: dict) -> list[dict]:
    query_embedding = await embeddings.embed(query)

    candidates = await vector_index.search(
        embedding=query_embedding,
        top_k=30,
        filters={
            "status": "current",
            "permissions": {"contains": user["role"]},
        },
    )

    reranked = await reranker.score(query=query, chunks=candidates)

    return reranked[:5]

This example shows the basic pattern: retrieve more candidates than needed, apply metadata constraints, then rerank before choosing the final context.

Retrieval quality usually fails in predictable ways.

The system may retrieve chunks that are related but not specific enough. It may return multiple near-duplicate chunks from the same document. It may miss the correct evidence because the chunk was badly split. It may rank old documentation above current documentation. It may retrieve the right document but the wrong section. Or it may return useful evidence, but not enough of it for the model to answer safely.

These failures are important because they are not generation problems. A better prompt may not fix them. The issue is often in the retrieval layer.

For example:

If the model gives a vague answer, the root cause may be that the retriever only returned general overview chunks.
If the model gives an outdated answer, the index may contain stale embeddings.
If the model confidently answers from the wrong policy, metadata filtering may be missing or too weak.

This is why retrieval should be observable.

For every answer, the system should be able to show:

Which chunks were retrieved.
Which documents they came from.
What filters were applied.
What scores were assigned.
Which chunks were included in the final context.
Whether the user accepted or rejected the answer.

Without this trace, teams end up guessing whether the problem came from the model, the retriever, the index, or the data pipeline.

Embeddings are powerful, but they are not a complete retrieval strategy.

Production RAG needs embeddings, filters, ranking, reranking, freshness controls, permission checks, and feedback loops working together.

The goal is not just to retrieve text that looks similar. The goal is to retrieve the right evidence for the user’s question, from the right source, at the right time, with enough traceability to debug when it fails.

Context Construction, Evaluation, and Observability

Retrieval does not end when the system finds a set of similar chunks. Those chunks still need to be turned into useful model context.

Context construction is the process of deciding what retrieved information gets passed to the language model, in what order, with what metadata, and under what constraints.

This step is easy to underestimate, but it has a large effect on answer quality.

A simple RAG system might take the top five retrieved chunks and paste them into the prompt. That can work for demos, but production systems usually need more control.

The system may need to:

Remove duplicate or near-duplicate chunks.
Prioritise newer or more authoritative sources.
Group chunks from the same document.
Preserve source titles and section headings.
Include citations or document references.
Respect the model’s token budget.
Avoid mixing unrelated evidence.
Separate retrieved context from system instructions and user input.

The token budget matters because models have a limited context window. Even when the window is large, filling it with noisy retrieval results can reduce answer quality.

More context is not always better.

The goal is to provide enough evidence to answer the question without burying the model in irrelevant text.

Ordering also matters. If the most relevant chunk appears after several weaker chunks, the model may anchor on the wrong information. If retrieved context mixes old and new policies without clear timestamps, the model may combine them into an answer that sounds reasonable but is not accurate.

A useful production pattern is to treat context as a structured object before it becomes prompt text.

For example, each retrieved chunk can carry its source document, section title, update date, access level, retrieval score, and reranking score. The final prompt is then built from structured evidence, not from raw text concatenation.

def build_context(chunks: list[dict], max_chunks: int = 5) -> str:
    selected = chunks[:max_chunks]

    sections = []
    for chunk in selected:
        sections.append(
            "\n".join([
                f"Source: {chunk['document_title']}",
                f"Section: {chunk['section_title']}",
                f"Updated: {chunk['updated_at']}",
                f"Chunk ID: {chunk['chunk_id']}",
                chunk["text"],
            ])
        )

    return "\n\n---\n\n".join(sections)

This also makes evaluation easier.

RAG evaluation should not only ask whether the final answer sounds good. It should check whether the system retrieved the right evidence and whether the answer stayed grounded in that evidence.

Evaluation area	What to check
Retrieval recall	Did the system retrieve the documents or chunks needed to answer correctly?
Retrieval precision	Were the retrieved chunks actually relevant to the query?
Groundedness	Is the answer supported by the retrieved context?
Faithfulness	Did the model avoid adding claims that were not in the evidence?
Freshness	Was the answer based on current documents rather than stale content?
Latency	Did ingestion, retrieval, reranking, and generation meet the required response time?
Cost	Were embedding, search, reranking, and model costs acceptable for the use case?
Debuggability	Can engineers trace the answer back to the retrieved sources and pipeline steps?

This is where production RAG becomes an iterative engineering loop.

Teams need a set of test questions, expected source documents, and expected answer characteristics. This is sometimes called a golden dataset: a small but trusted evaluation set used to test whether changes improve or break the system.

For example, if a team changes chunk size, embedding model, reranker, or metadata filters, they should be able to test whether retrieval recall improved or declined.

Without evaluation, RAG development becomes subjective. One answer may look better, but the system may be worse across the wider set of queries.

Observability is the production side of the same idea. A RAG system should log enough information to debug real user failures.

For each request, useful logs include:

Original user query.
Rewritten or expanded query, if used.
Retrieved chunk IDs.
Source documents.
Metadata filters.
Retrieval scores.
Reranking scores.
Final context passed to the model.
Model response.
Latency by stage.
Cost by stage.
User feedback, if available.

A compact trace event might look like this:

{
  "request_id": "req_123",
  "query": "How do I reset account access?",
  "retrieved_chunks": [
    {
      "chunk_id": "auth_policy#04",
      "document": "Authentication Policy",
      "retrieval_score": 0.82,
      "rerank_score": 0.91
    }
  ],
  "filters": {
    "status": "current",
    "permission_group": "support"
  },
  "latency_ms": {
    "embedding": 70,
    "retrieval": 180,
    "reranking": 240,
    "generation": 1900
  },
  "answer_validation": {
    "grounded": true,
    "citations_present": true
  }
}

This trace helps separate different failure types.

If the correct document was never ingested, the problem is in the data pipeline. If it was ingested but not retrieved, the problem may be chunking, embeddings, filters, or ranking. If it was retrieved but excluded from the final prompt, the issue is context construction. If it was included and the model still answered incorrectly, the issue may be generation, prompting, ambiguity, or insufficient evidence.

That distinction matters because each failure requires a different fix.

Production RAG also needs monitoring for operational issues. Indexes can become stale. Embedding jobs can fail. Document permissions can change. Retrieval latency can increase as the corpus grows. Costs can rise if reranking or model calls are overused. User behaviour can shift, causing the system to receive questions it was not designed to answer.

A reliable RAG system should therefore be monitored like any other backend service. It needs metrics, alerts, dashboards, regression tests, and clear ownership of failures.

The final answer is what the user sees, but the retrieval trace is what engineers need.

Without evaluation and observability, it is difficult to know whether a RAG system is improving or simply producing confident-looking answers.

Good production RAG is not built by trusting the model. It is built by measuring the full path from data ingestion to retrieved evidence to final response.

Closing Thoughts: RAG Is Backend Engineering With an AI Interface

RAG is often presented as a simple pattern: retrieve some documents, add them to the prompt, and let the model answer.

That explanation is useful at the start, but it misses the reality of production systems.

In production, RAG is a pipeline. It depends on clean ingestion, sensible chunking, suitable embeddings, reliable retrieval, careful context construction, evaluation, monitoring, and failure handling.

Each stage can improve the final answer, and each stage can also break it.

The model is important, but it is not the whole system. A stronger model will not fix missing documents, stale indexes, poor metadata, broken chunking, weak permissions, or noisy context.

Many RAG failures are system design failures before they are model failures.

This is why backend engineering matters so much in AI applications. Production RAG requires the same discipline as other reliable services:

Data validation.
Versioning.
Retries.
Observability.
Access control.
Latency budgets.
Cost awareness.
Clear debugging paths.

A good RAG system should be able to explain not only what answer it produced, but which evidence it used, where that evidence came from, how fresh it was, and why it was selected.

Without that traceability, teams cannot improve the system with confidence.

The practical goal is not to make RAG feel magical. The goal is to make retrieval dependable enough that the model receives the right context, at the right time, under the right constraints.

For backend engineers and AI engineers, that is the real work: building the system around the model so that answers are grounded, measurable, debuggable, and useful in production.