What Is an Evaluation Dataset and Why Every AI Pilot Needs One

Every demo of an AI product is a magic trick. The vendor types five queries they know the system handles. The buyer is impressed. The contract gets signed. Six weeks into production, the system fails on a real user input and nobody can explain why, because nobody wrote down what 'working' was supposed to mean.

The fix is not better demos. The fix is an evaluation dataset. An eval set is a fixed collection of input and output pairs that the AI system has to pass. It is the unit-test layer for AI software. Without it, you have no way to tell whether yesterday's prompt change improved the system, broke it, or did nothing detectable. With it, you have a reproducible, auditable answer to the question 'is this AI good enough yet.'

This article explains what belongs in an eval set, why demos are systematically misleading, and how to build a working eval set in a week. It is the practice that, more than any other, separates the AI pilots that produce P&L impact from the 95% that, according to the MIT NANDA 2025 study of 300 enterprise GenAI deployments, do not.

What an evaluation dataset actually is

An eval set is not a benchmark and it is not a test suite for the model. It is a test suite for your specific use of the model, on your data, against your acceptance criteria. It is small enough to run on every change (typically 50 to 500 examples) and structured enough that the pass rate is a number, not an opinion.

Five components belong in a useful eval set. If any of them is missing, the eval misleads in a particular way.

Golden examples (the happy path). 30 to 50 representative inputs with known-correct outputs that any production-grade version of the system has to pass. These anchor the definition of 'working.' If the system breaks on a golden example, the build is broken.
Edge cases. Unusual but legitimate inputs: long inputs, multilingual content, malformed punctuation, ambiguous references, misspellings. These catch roughly the 20% of real traffic that breaks naive prompts.
Adversarial cases. Prompt injection, jailbreak attempts, off-topic asks, requests for restricted information. Anthropic explicitly recommends testing both where a behaviour should and where it should not occur, so you do not optimise one side at the cost of the other.
Regression set. Every bug ever fixed, frozen as a permanent test case. Without this, you re-introduce the same defect every time you swap a model or rewrite a prompt. The regression set grows over time and is the highest-leverage part of the eval.
Distribution-matched real traffic. Anonymised production logs sampled to match the actual input distribution. Without this, your eval optimises for what you imagined users would do, not what they actually do. The first time you look, the gap between imagined and real traffic will surprise you.

Why demos systematically lie

Demos are not evidence. There are three specific reasons they cannot be.

Cherry-picked happy path. A demo shows 5 to 10 inputs the vendor knows the system handles. Real production traffic is a long tail. The demo is the 90th percentile of 'looks good,' not the median of 'what users actually send.'
Distribution mismatch (the evaluator-as-user fallacy). The buyer types queries the way they imagine a user types them. Actual users abbreviate, misspell, paste 4 KB emails, attach the wrong file, and switch languages mid-sentence. A demo with the buyer at the keyboard is not a sample of production traffic. It is a sample of the buyer's imagination.
No regression evidence. A demo proves the system works once, on one day, against one model version. It says nothing about whether the same prompt will work after the model is upgraded, the temperature drifts, or you make a small change to a system prompt. Without an eval set, 'it worked in the demo' has zero predictive value.

How big should the eval set be?

The honest answer is: bigger than you think on day one, smaller than the literature suggests. A useful rule of thumb from practitioner writing (Cameron Wolfe's 'Applying Statistics to LLM Evaluations,' 2025): around 246 examples are needed for a ±5% margin of error at 95% confidence on a metric expected to be around 80%. In practice, teams start with 50 to 100 hand-curated examples, grow to 200 to 500 by the time they ship, and let the regression set continue to grow as production traffic surfaces new failure modes.

Smaller sets are useful for fast iteration. Bigger sets are useful for confident statements about quality. You want both, in the same harness, with the small set running on every PR and the big set running nightly or on release.

Manual review, automated metrics, and LLM-as-judge

There are three ways to grade eval outputs. Each has a place. None is sufficient on its own.

Automated metrics (exact match, regex, JSON schema validation, BLEU, ROUGE, custom rules). Cheap, deterministic, and the right answer for structured tasks (extraction, classification, code generation). Useless for free-form judgment.
Human review. The gold standard for subjective tasks, and the only way to bootstrap a useful eval set in the first place. Slow and expensive at scale, but indispensable on day one and for periodic recalibration.
LLM-as-judge. A second LLM evaluates the first model's output against a rubric. The 2024 Survey on LLM-as-a-Judge (arXiv:2411.15594) reports strong judges achieve above 80% agreement with human raters, matching inter-human agreement on many tasks. Useful, but biased.

Build your eval set in a week

This is the workflow we use on day one of every Acme Consulting engagement. It is intentionally minimal. The point is not to produce a perfect eval set in five days. It is to produce one that is good enough to make the next decision, and to set up a habit of growing it.

Day 1: define success criteria in writing. For each user task, state the binary pass/fail rule. Anthropic's guidance is unambiguous: 'everything the grader checks should be clear from the task description.' If you cannot write the rule, you cannot evaluate the system. This is the step most teams skip and most regret.
Day 2: author 20 to 30 golden examples by hand. Real inputs from the target domain with the correct output written out. Skip synthetic data on day one. Hand-curated beats LLM-generated for the seed set, every time.
Day 3: add edge cases and adversarial cases. Aim for 15 to 20 of each. Use the production logs you already have (anonymised) or interview the operator who handles the messiest 10% of requests today. They know the failure modes you do not.
Day 4: wire up the harness. Pick one tool. Promptfoo (YAML, CI-friendly) and Inspect AI (Python, the de-facto standard adopted by Anthropic, DeepMind, and METR) are both excellent open-source choices. Braintrust is a managed option with a good production-trace-to-eval pipeline. Get one command, like 'promptfoo eval' or 'inspect eval', that runs the whole set and reports pass rate.
Day 5: run, analyse failures, and lock in regression cases. Run the suite. For every failure, decide: bug to fix, spec to clarify, or example to remove. Every fixed bug becomes a permanent regression test. Set a CI gate so a PR cannot merge if the pass rate drops below the previous baseline.

By the end of week one you have around 80 examples. By month three you have 200 to 500. By month twelve, the regression set is the most valuable asset in the project, because it encodes everything the system has ever been wrong about and is now right about.

Why we author the eval set on day one

Every Acme Consulting engagement has an eval-driven structure. We write the eval set before we write the prompt, before we pick the model, before we choose retrieval architecture. We do this for two reasons. The first is that it forces the conversation about acceptance criteria to happen at the start, when changes are cheap. The second is that an eval set is the only artefact in an AI engagement that becomes more valuable over time, regardless of which model wins next year. Prompts will be rewritten. Models will be swapped. The eval set, if it captures the actual definition of 'working' for your domain, outlives all of them.

If a vendor is unwilling to author one with you on day one, the engagement is shaped to produce a demo, not a system. That is the single most useful filter you can apply.