Guide Best Practices

Building an AI Evaluation Programme from Scratch

SingleAxis Research
Building an AI Evaluation Programme from Scratch

Building an AI Evaluation Programme from Scratch

Most organisations deploying AI have some form of testing. Few have an evaluation programme. The difference is structural: testing asks "does this work?"; evaluation asks "how well does this work, where does it fail, how severely, and what should we do about it?"

If your current evaluation practice consists of team members running prompts and reviewing outputs informally, you are not behind — you are at the starting point. This guide walks through the steps to build a structured evaluation programme that produces reliable, auditable, actionable results.

The Maturity Model

AI evaluation capability develops through four stages. Most organisations begin at Stage 1 and benefit from moving to Stage 2 within their first quarter. Stages 3 and 4 are where evaluation becomes a competitive and regulatory advantage.

Step 1: Define Your Evaluation Scope

Before you evaluate anything, you need to know what you are evaluating and why. This sounds obvious. It is routinely skipped.

Identify the AI systems in scope. List every AI system your organisation deploys or plans to deploy. For each system, document: what it does, who uses it, what decisions it informs, and what the consequences of failure are. This inventory is the foundation of your evaluation programme.

Classify by risk. Not every system needs the same level of evaluation rigour. A customer-facing medical summarisation tool requires comprehensive, expert-led evaluation. An internal code suggestion tool may need lighter-touch assessment. Use the risk classification to allocate evaluation resources proportionally.

Define what "good" means. For each system, articulate the quality criteria that matter. This is not "the output should be correct" — that is too vague to evaluate. Good criteria are specific and measurable: "The output should not contain factual claims unsupported by the source document." "The output should correctly identify all entities mentioned in the input." "The output should follow the specified format without deviation."

Your quality criteria will evolve, and that is expected. The point is to start with explicit criteria rather than implicit assumptions.

Step 2: Build Your Taxonomy

A taxonomy is a structured classification system for the things that can go wrong. Without one, every evaluator invents their own vocabulary for failures, and your evaluation data is incomparable across assessments.

Start with established frameworks. The SASF taxonomy provides a comprehensive starting point: eleven categories (Faithfulness, Safety, Privacy, Quality, Instruction Following, Refusal Behaviour, Retrieval/Tool Use, Multi-turn Coherence, Voice & Audio, Vision & Visual, Fairness & Bias) with 103 specific codes. You do not need all 103 codes for every project. Select the subset relevant to your systems and add domain-specific codes where the standard taxonomy does not cover your failure modes.

Define severity levels. Each finding code needs a severity classification — typically Critical, High, Medium, and Low. Severity should reflect the consequence of the failure in your specific deployment context. A hallucinated fact (F1.1) might be Critical in a healthcare context and Medium in a creative writing assistant. Severity calibration is a policy decision, not a technical one, and it should involve stakeholders who understand the downstream impact of failures.

Document everything. Your taxonomy document should include: each code's identifier, its plain-English label, a precise definition, representative examples of what does and does not qualify as that finding, and the default severity level. This document is the single source of truth for your evaluation team.

Step 3: Design Your Rubric

The rubric is the instrument your evaluators use to assess each AI output. It translates your quality criteria and taxonomy into a structured assessment form.

Choose dimension types carefully. Scale dimensions (1-5 ratings) work well for qualities that exist on a continuum — fluency, relevance, completeness. Boolean dimensions (yes/no) work for binary checks — "Does the output contain personally identifiable information?" Select dimensions work when the evaluator needs to classify the output into mutually exclusive categories.

Link findings to rubric dimensions. Each dimension should specify the score threshold at which a finding is triggered and the taxonomy codes that apply. If an evaluator scores "Factual Accuracy" at 2 or below, the rubric should prompt them to select the specific faithfulness code that applies (F1.1 Fabricated claims, F1.2 Unsupported claims, F2.1 Misrepresentation, etc.) and provide a brief annotation.

Write clear scale labels. A 5-point scale where 1 means "bad" and 5 means "good" produces unreliable data. Each scale point needs a specific description that minimises interpretive ambiguity. "3 — Output is mostly accurate but contains one or two minor factual imprecisions that do not affect the overall conclusion" is evaluable. "3 — Average" is not.

Pilot your rubric. Before deploying to your full evaluation team, have three to five evaluators independently assess the same ten outputs using the rubric. Compare their results. Where they diverge, the rubric is ambiguous. Revise the rubric, then pilot again. Two rounds of piloting typically produce a stable instrument.

Step 4: Recruit and Calibrate Evaluators

The quality of your evaluation is bounded by the quality of your evaluators. This is not a task for interns or crowd workers — at least not for high-risk systems.

Match expertise to domain. Evaluating a medical AI system requires evaluators with clinical knowledge. Evaluating a legal AI system requires evaluators who understand legal reasoning. Domain expertise is not optional for high-risk evaluation — it is the difference between identifying a plausible-sounding hallucination and missing it entirely.

Calibrate with gold tasks. Gold tasks are pre-evaluated outputs with known correct assessments. New evaluators complete gold tasks as part of onboarding, and ongoing gold tasks are interspersed with real assessments to monitor calibration over time. Evaluators who consistently deviate from the gold standard need retraining or removal from the evaluation pool.

Measure inter-annotator agreement. Assign a subset of outputs to multiple evaluators and measure their agreement. Cohen's kappa above 0.7 indicates substantial agreement and a reliable rubric. Below 0.5 indicates problems with the rubric, the evaluator pool, or both.

Step 5: Establish Quality Gates

Quality gates are the checkpoints that prevent unreliable evaluation data from contaminating your results.

Evaluator-level gates. Minimum gold task accuracy (typically 80% or higher). Maximum deviation from mean assessment time (evaluators who are dramatically faster or slower than peers warrant investigation). Minimum inter-annotator agreement on overlapping assignments.

Project-level gates. Minimum number of evaluated outputs for statistical significance. Minimum evaluator coverage (no single evaluator should assess more than a defined percentage of total outputs). Completion of all gold task insertions.

Report-level gates. Peer review of the Evidence Report by a quality lead before delivery. Verification that findings are supported by evaluation data. Confirmation that statistical claims are methodologically sound.

Step 6: Generate and Act on Evidence Reports

The Evidence Report is the output of your evaluation programme, but it is not the end. It is the beginning of a remediation cycle.

Distribute findings to the right teams. Critical and High severity findings should trigger immediate remediation workflows. The engineering team needs the specific failure examples and root cause analysis. The product team needs the severity distribution to inform prioritisation. The compliance team needs the audit trail for regulatory documentation.

Track remediation. Every finding in an Evidence Report should be tracked through remediation. Did the engineering team address the root cause? Did the fix reduce the prevalence of the finding in the next evaluation? An evaluation programme without remediation tracking is a programme that generates reports but does not generate improvement.

Re-evaluate after changes. When a model is updated, a prompt is revised, or a guardrail is added, re-evaluate. The new Evidence Report should show measurable improvement in the areas targeted by the remediation. If it does not, the remediation was insufficient.

Step 7: Iterate and Mature

Your evaluation programme will improve with each cycle. The first Evidence Report will be rougher than the tenth. The taxonomy will expand as you encounter new failure modes. The rubric will sharpen as you learn which dimensions produce reliable data and which need revision.

The key is to treat evaluation as a continuous practice, not a one-time project. The organisations that build evaluation into their AI development lifecycle — evaluating early, evaluating often, and acting on what they find — are the organisations that deploy AI systems they can actually trust.

Start where you are. Build the structure. Generate the evidence. Improve from there.