Evaluating LLM Applications: Why Unit Tests Are Not Enough

Introduction: LLM Testing Is Not Just Input and Output Matching

Traditional software tests work well when the expected behaviour is deterministic. If a function receives the same input, it should return the same output. If an API receives invalid data, it should reject the request. If a database transaction fails, the system should roll back safely.

LLM applications are different.

A language model can receive the same prompt twice and produce slightly different answers. Two answers may use different wording but both be acceptable. An answer can be fluent and confident while still being wrong. In a Retrieval-Augmented Generation system, usually called RAG, the output may depend not only on the prompt, but also on retrieved documents, chunking, ranking, metadata filters, and how context is assembled before the model sees it.

This does not mean unit tests are useless. They are still essential for the deterministic parts of the application:

Request validation.
Authentication.
Database writes.
Queue handling.
Tool wrappers.
Retries.
Timeouts.
API contracts.

But unit tests are not enough to answer questions like:

Did the model answer correctly?
Did it use the right evidence?
Did retrieval return the right context?
Did the answer stay grounded in the source material?
Did a prompt change make quality worse?
Did a model upgrade increase cost or latency?
Did the system fail safely when it lacked enough information?

These are evaluation problems, not simple unit test problems.

Evaluating LLM applications means testing the behaviour of the whole system. That includes model outputs, prompts, retrieved context, safety constraints, user experience, latency, cost, and production failure cases.

The goal is not just to check whether the code runs. The goal is to check whether the application behaves reliably enough for the use case.

This blog explains LLM evaluation from a practical engineering perspective. It covers where unit tests still help, why model outputs are harder to test, how golden datasets and criteria-based evaluations work, where human review and LLM-as-judge fit, and how teams monitor quality, regressions, latency, and cost in production.

The main idea is simple:

Unit tests protect the code, but evaluations protect the behaviour of the LLM system.

Why Unit Tests Are Still Useful but Incomplete

Unit tests still matter in LLM applications.

The mistake is not using unit tests. The mistake is expecting them to evaluate everything.

A typical LLM application still has plenty of deterministic software around the model. It may have API endpoints, authentication, request validation, database writes, message queues, caching, retries, timeouts, rate limits, and tool integrations.

These parts should be tested in the same way as any other backend service.

For example, unit tests are useful for checking that:

Invalid requests are rejected.
Required fields are validated.
User permissions are enforced.
Tool wrappers format inputs correctly.
Retries stop after the configured limit.
Database updates are atomic.
API responses follow the expected schema.
Fallback logic runs when a dependency fails.

These are clear software behaviours. Given the same input, the system should behave in the same way. Unit tests are good at protecting this kind of logic.

A simple deterministic unit test might look like this:

import pytest

from app.validation import validate_chat_request


def test_chat_request_requires_user_message():
    payload = {
        "user_id": "user_123",
        "message": "",
    }

    with pytest.raises(ValueError, match="message is required"):
        validate_chat_request(payload)

This test is useful because the behaviour should be exact. An empty message should always be rejected.

Unit tests are also useful for tool wrappers:

from app.tools import build_search_payload


def test_search_payload_includes_metadata_filters():
    payload = build_search_payload(
        query="reset account access",
        tenant_id="tenant_abc",
        document_type="policy",
    )

    assert payload == {
        "query": "reset account access",
        "filters": {
            "tenant_id": "tenant_abc",
            "document_type": "policy",
        },
    }

This protects deterministic application logic. It does not evaluate whether the final LLM answer is good.

The limitation appears when the test needs to judge the quality of a generated answer.

For example, imagine a support assistant is asked:

How do I reset my account access?

A traditional unit test might check that the response is a string, that the API returns a 200 status code, or that the model call was made successfully. Those checks are useful, but they do not prove that the answer is correct.

The answer may be too vague, based on the wrong policy, missing an important security step, or unsupported by the retrieved context.

That is the gap between software correctness and model behaviour quality.

Test type	Best for	Example	Limitation
Unit tests	Deterministic code	Validate request parsing or tool wrapper behaviour	Cannot judge answer quality
Integration tests	Connected services	Check API, database, queue, and model call flow	May not evaluate usefulness
Golden dataset evaluations	Behaviour consistency	Run fixed test cases against expected criteria	Needs maintenance
Human review	Domain judgement	Review complex or high-risk answers	Slow and expensive
LLM-as-judge	Scalable quality checks	Score groundedness or completeness	Can be biased or inconsistent
Production monitoring	Real-world behaviour	Track latency, cost, failures, and user feedback	Detects issues after deployment

This is why LLM applications need a layered testing strategy.

Unit tests should protect the application infrastructure. Integration tests should check that the full request path works. Evaluation datasets should test whether the system gives good answers across realistic scenarios. Production monitoring should detect issues that only appear with real users, real data, and real traffic.

The key point is that LLM evaluation does not replace traditional testing. It extends it.

A reliable LLM application still needs normal backend tests. But those tests must be combined with evaluation methods that measure things like correctness, groundedness, safety, retrieval quality, latency, and cost.

In other words, unit tests can tell you whether the system executed correctly. They cannot always tell you whether the answer was good.

Why LLM Outputs Are Harder to Test

LLM outputs are harder to test because they are not always binary.

In normal backend systems, a test often has a clear expected result. A function should return a specific value. An endpoint should return a specific status code. A database write should either succeed or fail.

With LLM applications, the output is usually language. That means there may be many acceptable answers, and they may all look different.

For example, if a user asks:

Can I cancel my subscription after the renewal date?

There may not be one exact sentence the system must return.

A good answer might explain the policy, mention exceptions, include the relevant date rules, and suggest the next action. Another good answer might use different wording but still be correct. Exact string matching becomes too strict because it fails valid answers that are phrased differently.

At the same time, loose checks can be too weak. A response may sound polished while missing the most important detail. It may use confident wording while relying on the wrong source. It may answer the general topic but not the user’s specific question.

This creates a testing problem:

LLM quality is not only about whether the response exists. It is about whether the response is useful, correct, grounded, safe, and appropriate for the context.

There are several reasons this is difficult.

First, LLM outputs are probabilistic. The same input can produce slightly different responses depending on the model, temperature, prompt, system instructions, retrieved context, and provider behaviour. This does not mean the system is uncontrollable, but it does mean tests should not depend on one exact wording.

Second, correctness can depend on external context. In a RAG system, the answer depends on what documents were retrieved, how they were ranked, and which chunks were included in the final prompt. If retrieval changes, the answer may change even if the user prompt stays the same.

Third, quality is often multi-dimensional. An answer can be factually correct but too vague. It can be complete but too slow. It can be helpful but unsafe. It can be grounded in evidence but written in a tone that does not fit the product.

Fourth, LLMs can fail in fluent ways. Traditional software failures are often visible: an exception, a timeout, a failed assertion, or a bad status code. LLM failures can look successful at the API level while still being wrong.

The request returns 200 OK, the output is readable, and the user interface looks fine, but the answer may be unsupported or misleading.

Finally, small changes can create regressions. A prompt edit, model upgrade, embedding model change, chunking adjustment, or reranker update can improve one group of examples while making another group worse.

This makes evaluation an ongoing engineering process, not a one-time test.

That is why LLM applications need tests that reflect behaviour, not just execution. Engineers need to check whether the system answers the right question, uses the right evidence, refuses when it should, stays within product constraints, and performs within acceptable latency and cost limits.

The challenge is not that LLMs cannot be tested. The challenge is that they need to be tested differently.

Golden Datasets and Criteria-Based Evaluation

A useful starting point for LLM evaluation is a golden dataset.

A golden dataset is a curated set of test cases used to evaluate whether the application behaves correctly across important scenarios. Each test case usually includes the user input, the expected behaviour, any required facts, and the criteria used to judge the response.

For a normal unit test, the expected output might be exact:

assert add(2, 3) == 5

For an LLM application, that style is usually too rigid. A good answer may be phrased in several valid ways. Instead of testing exact wording, engineers often test against criteria.

For example, a support assistant test case might define:

User question: “How do I reset my account access?”
Required facts: mention the identity portal, password reset flow, and multi-factor authentication.
Forbidden behaviour: do not suggest sharing passwords or bypassing security checks.
Quality criteria: answer should be clear, correct, grounded, and concise.

The evaluation is not asking:

Did the model produce this exact sentence?

It is asking:

Did the response satisfy the expected behaviour?

This makes evaluation more realistic. LLM applications are usually judged by whether they solve the user’s problem, not whether they produce identical text every time.

A golden dataset can be stored as structured test cases:

- id: account_reset_001
  user_question: "How do I reset my account access?"
  required_facts:
    - "Use the identity portal"
    - "Follow the password reset flow"
    - "Complete multi-factor authentication"
  forbidden_behaviour:
    - "Do not suggest sharing passwords"
    - "Do not suggest bypassing security checks"
  criteria:
    correctness: "Answer accurately explains the reset process"
    groundedness: "Answer is supported by retrieved policy context"
    conciseness: "Answer is clear and not overly long"
    safety: "Answer does not weaken account security"

A golden dataset should include common cases, edge cases, and failure cases.

Common cases test the normal user journey. Edge cases test ambiguity, missing context, unusual wording, or conflicting information. Failure cases test whether the system refuses safely, asks for clarification, or avoids unsupported claims.

For RAG systems, the dataset should also include expected source documents or chunks. This allows engineers to evaluate retrieval separately from answer generation. If the final answer is wrong, the team can check whether the correct evidence was retrieved in the first place.

A simple LLM evaluation pipeline looks like this:

flowchart TD
    A[Test Cases] --> B[Golden Dataset]
    B --> C[Run Application Version]
    C --> D[Collect Model Outputs]
    D --> E[Evaluate Against Rubric]
    E --> F[Compare to Previous Version]
    F --> G{Deployment Decision}
    G -->|Pass| H[Deploy]
    G -->|Review| I[Human Review]
    G -->|Fail| J[Block Change]

The most important part is the rubric. A rubric is a scoring guide that defines what good behaviour means. Without a rubric, evaluation becomes subjective and inconsistent.

Useful criteria might include:

Evaluation criterion	What it checks
Correctness	Does the answer accurately address the user’s request?
Groundedness	Is the answer supported by the provided or retrieved evidence?
Completeness	Does it include the key details needed to be useful?
Conciseness	Does it avoid unnecessary or distracting information?
Safety	Does it avoid harmful, unauthorised, or policy-breaking content?
Refusal quality	Does it refuse or ask for clarification when the system lacks enough information?
Tone	Does the response fit the product and user context?

The scores do not have to be complicated. Some teams use pass/fail. Others use a 1–5 scale. The important thing is consistency.

If a prompt change improves tone but reduces correctness, the evaluation should reveal that trade-off before deployment.

A minimal evaluation result might look like this:

{
  "case_id": "account_reset_001",
  "scores": {
    "correctness": 5,
    "groundedness": 5,
    "completeness": 4,
    "safety": 5,
    "conciseness": 4
  },
  "passed": true,
  "notes": "Answer included all required security steps and stayed grounded in the policy context."
}

Golden datasets also need maintenance. They should evolve as the product changes, new failure modes appear, and real users ask questions the team did not expect. A stale evaluation set can create false confidence because the system may perform well on old examples while failing on current usage.

The practical goal is not to create a perfect benchmark. The goal is to create a repeatable way to compare versions of the application.

When engineers change a prompt, model, retriever, embedding model, or tool-calling workflow, they should be able to ask:

Did this change improve behaviour, make it worse, or move the failure somewhere else?

That is what golden datasets and criteria-based evaluation provide: a structured way to measure LLM behaviour instead of relying on vibes, demos, or isolated examples.

Human Review and LLM-as-Judge

Not every LLM output can be evaluated automatically with simple rules. Some answers require judgement.

A response may be technically correct but unclear. It may include the right facts but miss the user’s intent. It may be mostly grounded but include one unsupported claim. These are quality problems, and they often need review methods that go beyond unit tests.

The most direct method is human review.

Human review means a person checks model outputs against a rubric. The reviewer may score correctness, groundedness, completeness, safety, tone, and usefulness. This is especially important in domains where mistakes are costly, such as finance, healthcare, legal workflows, compliance, customer support, or internal decision-making tools.

Human review is valuable because humans can understand context, business rules, and user intent better than automated metrics. A reviewer can notice when an answer is technically accurate but not actually helpful. They can also spot subtle risks, such as an answer that sounds too confident when the evidence is weak.

The downside is that human review is slow and expensive. It also introduces inconsistency. Two reviewers may judge the same answer differently unless the rubric is clear.

That is why human review works best when the criteria are specific.

Instead of asking:

Is this a good answer?

A stronger rubric asks:

Does the answer use the retrieved evidence?
Does it include all required facts?
Does it avoid unsupported claims?
Does it follow the product’s tone and safety rules?
Does it ask for clarification when the input is ambiguous?

This makes the review more repeatable.

The other common method is LLM-as-judge.

LLM-as-judge means using a language model to evaluate another model’s output. For example, the judge model may receive the user question, retrieved context, generated answer, and scoring rubric. It then scores the answer against criteria such as correctness, groundedness, completeness, or safety.

This can be useful because it scales better than human review. Engineers can run LLM-based evaluation across hundreds or thousands of test cases whenever they change a prompt, model, retrieval pipeline, or tool workflow.

A judge prompt might ask:

Given the user question, retrieved context, and generated answer,
score whether the answer is grounded in the context.

Return:
- score: 1 to 5
- explanation: brief reason
- unsupported_claims: list any claims not found in the context

In code, an evaluation call might be represented as a small structured request:

def build_groundedness_eval(question: str, context: str, answer: str) -> dict:
    return {
        "task": "score_groundedness",
        "question": question,
        "retrieved_context": context,
        "generated_answer": answer,
        "rubric": {
            "1": "Answer is mostly unsupported by the context",
            "3": "Answer is partially supported but includes gaps",
            "5": "Answer is fully supported by the context",
        },
        "expected_output": {
            "score": "integer from 1 to 5",
            "explanation": "brief reason",
            "unsupported_claims": "list of unsupported claims",
        },
    }

This is useful, but it is not perfect.

LLM judges can be biased. They may prefer longer answers. They may reward confident writing even when the answer is weak. They may miss domain-specific mistakes. They may be inconsistent across runs. They may also fail when the retrieved context is long, messy, or requires expert interpretation.

This means LLM-as-judge should not be treated as absolute truth. It is better used as a scalable signal.

A practical setup is to combine several approaches:

Use automated checks for deterministic rules.
Use LLM-as-judge for scalable quality scoring.
Use human review for high-risk cases, unclear failures, and calibration.
Compare evaluator results against human judgement over time.

The calibration step matters. If the LLM judge consistently disagrees with human reviewers, the rubric, judge prompt, or scoring method needs improvement.

The goal is not to remove humans completely. The goal is to use human review where it matters most and use automated evaluation to catch regressions earlier.

In production teams, evaluation often becomes a layered workflow. Simple checks catch obvious failures. LLM judges catch many quality regressions. Human reviewers inspect a smaller sample of important or uncertain cases. Production monitoring then shows what happens with real users.

The best evaluation systems do not rely on one method. They combine review methods so that weaknesses in one layer are covered by another.

Human review gives judgement. LLM-as-judge gives scale. Together, they help engineers evaluate behaviour in systems where correctness is not always a single fixed output.

Regression Testing for Prompts, RAG, and Model Changes

In traditional software, a regression happens when a change breaks behaviour that used to work. The same problem exists in LLM applications, but it can be harder to detect.

A prompt change may improve one answer and make another worse. A model upgrade may increase reasoning quality but change tone, latency, or cost. A new chunking strategy may improve retrieval for long documents but reduce precision for short ones. A reranker may improve relevance overall while accidentally pushing critical policy documents lower in some cases.

This is why LLM applications need regression testing.

Regression testing means running a fixed set of evaluation cases before and after a change, then comparing the results. The goal is not just to see whether the new version works. The goal is to check whether it is better, worse, or riskier than the current version.

For LLM systems, regression testing should cover more than final answer quality. It should also test the components that shape the answer.

For a prompt change, useful checks include:

Did correctness improve or decline?
Did the answer become too verbose or too vague?
Did refusals become too strict or too loose?
Did the model continue following the required format?
Did the tone still match the product?

For a model change, the team should also check:

Latency.
Cost per request.
Output consistency.
Instruction-following.
Safety behaviour.
Compatibility with existing prompts and tools.

For a RAG system, regression testing needs another layer because the answer depends on retrieval. Engineers should evaluate both the retrieved context and the generated response.

A RAG evaluation should ask:

Did the retriever return the expected source documents?
Were the retrieved chunks relevant?
Were important chunks missed?
Did the final answer use the retrieved evidence?
Did the model add unsupported claims?
Were citations or source references correct?
Did metadata filters exclude stale or unauthorised content?

This distinction matters.

If the answer is wrong because the right document was never retrieved, the problem is not mainly the model. It may be chunking, embeddings, filters, indexing, or ranking.

If the right evidence was retrieved but the answer still ignored it, the issue may be context construction, prompting, or model behaviour.

A practical release workflow might look like this:

Define the change: prompt update, model upgrade, retriever change, or indexing change.
Run the golden dataset against the current version.
Run the same dataset against the candidate version.
Compare scores across correctness, groundedness, safety, latency, and cost.
Review failures manually where the result is unclear.
Deploy only if the trade-offs are acceptable.

A simplified regression comparison could look like this:

def compare_eval_runs(baseline: list[dict], candidate: list[dict]) -> dict:
    baseline_by_id = {case["case_id"]: case for case in baseline}
    regressions = []

    for new_case in candidate:
        old_case = baseline_by_id[new_case["case_id"]]

        if new_case["scores"]["correctness"] < old_case["scores"]["correctness"]:
            regressions.append({
                "case_id": new_case["case_id"],
                "metric": "correctness",
                "before": old_case["scores"]["correctness"],
                "after": new_case["scores"]["correctness"],
            })

    return {
        "regression_count": len(regressions),
        "regressions": regressions,
    }

This kind of comparison does not replace human judgement. It gives engineers a structured way to identify where behaviour changed.

The key word is trade-offs.

A candidate version does not have to improve every metric. Sometimes a model that is slightly slower may be worth it if it significantly improves groundedness. Sometimes a cheaper model may be acceptable for low-risk queries but not for high-risk workflows. Evaluation helps make those trade-offs visible instead of relying on intuition.

Regression testing is also useful for detecting silent failures. An application may still return successful API responses while quality gets worse. Without evaluation, a team may only notice when users complain. With regression tests, the team can catch quality drops before deployment.

This is especially important when LLM systems depend on external providers or changing data. Model behaviour can shift. Documents can become stale. Embedding models can change. Retrieval indexes can be rebuilt incorrectly. A good evaluation workflow gives engineers a way to detect those changes early.

The goal is not to block every release with a heavy manual process. The goal is to create enough evaluation coverage that changes can be shipped with confidence.

For production LLM applications, prompts, models, retrieval settings, and evaluation datasets should all be treated as versioned parts of the system. If they can change behaviour, they need to be tested.

Closing Thoughts: Evaluate the System, Not Just the Model

LLM applications need unit tests, but unit tests are only one layer of quality control.

They protect the deterministic parts of the system: APIs, validation, database logic, queues, retries, permissions, and integrations. That still matters. A model-powered application is still a software application.

But the hardest failures in LLM systems often happen above the unit-test layer.

The code executes successfully, the API returns 200 OK, and the user receives a fluent answer. The problem is that the answer may be incomplete, unsupported, unsafe, too slow, too expensive, or based on the wrong context.

That is why LLM evaluation has to focus on behaviour.

A production evaluation strategy should ask practical questions:

Evaluation area	What to check
Correctness	Does the answer solve the user’s actual request?
Groundedness	Is the answer supported by retrieved or provided evidence?
Completeness	Does it include the important details without over-answering?
Safety	Does it avoid unsafe, unauthorised, or policy-breaking responses?
Retrieval quality	Did the system retrieve the right context?
Regression risk	Did a prompt, model, or retrieval change make answers worse?
Latency	Is the response fast enough for the product experience?
Cost	Is the runtime and evaluation cost sustainable?
Debuggability	Can engineers trace why the system produced that answer?

The key shift is moving from:

Does the function return the expected value?

to:

Does the application behave reliably across realistic scenarios?

That requires golden datasets, criteria-based scoring, human review, LLM-as-judge where appropriate, regression testing, and production monitoring.

None of these methods are perfect on their own. Together, they give engineers a clearer view of how the system behaves and where it fails.

This matters because LLM applications are sensitive to change. A prompt edit, model upgrade, retrieval change, new document source, or embedding update can improve one part of the system while breaking another. Without evaluation, these changes are judged by demos and intuition. With evaluation, they can be compared against real criteria.

The goal is not to make LLM systems perfectly predictable. The goal is to make them measurable, debuggable, and safe enough for the use case.

For backend engineers and AI engineers, this is the practical work behind reliable LLM applications.

Unit tests protect the code. Evaluations protect the behaviour. Observability connects both to what actually happens in production.

That is what makes LLM application development an engineering discipline rather than just prompt experimentation.