Product Methodology

Anatomy of an Evidence Report

SingleAxis Research
Anatomy of an Evidence Report

Anatomy of an Evidence Report

Every AI evaluation produces data. The question is whether that data is structured, auditable, and actionable — or whether it lives in a spreadsheet that no one looks at twice. An Evidence Report is SingleAxis's answer to that question. It is the structured output of a complete evaluation engagement, designed to give technical leaders, compliance teams, and product owners a clear picture of how an AI system performs, where it fails, and what to do about it.

This article walks through each section of an Evidence Report, explains its purpose, and shows how the pieces fit together to form a comprehensive evaluation record.

The Evaluation Pipeline

Before examining the report itself, it is worth understanding the pipeline that produces it.

Each stage feeds the next. Data ingestion brings in the AI outputs to be evaluated. Evaluator assignment matches qualified assessors to the task based on domain expertise. Structured evaluation is the core assessment phase, where evaluators score outputs against a rubric with taxonomy-linked findings. Quality assurance validates evaluator consistency through gold tasks and inter-annotator agreement. Statistical analysis aggregates individual assessments into population-level metrics. The Evidence Report synthesises all of this into a document that supports decision-making.

Section 1: Executive Summary

The executive summary exists for one reader: the person who needs to make a decision in the next fifteen minutes. It contains the evaluation scope, the headline metrics, and the critical findings — nothing more.

A typical executive summary answers three questions:

  • What was evaluated? The AI system, the specific capability or workflow under assessment, and the volume of outputs reviewed.
  • What was the overall result? Aggregate quality scores, the distribution of severity levels across findings, and a clear pass/fail/conditional assessment if the engagement defined threshold criteria.
  • What are the critical issues? The findings classified as Critical or High severity, stated plainly without requiring the reader to interpret statistical tables.

The executive summary is not a marketing document. It is a clinical assessment. If the system performed well, the summary says so. If the system has critical failures, the summary states them without hedging.

Section 2: Evaluation Methodology

Reproducibility is a foundational requirement. The methodology section documents exactly how the evaluation was conducted so that it can be repeated, audited, or compared against future evaluations.

This section covers:

Taxonomy configuration. Which taxonomy categories and codes were active for this evaluation. SingleAxis maintains the SASF master taxonomy of 103 codes across eleven categories (Faithfulness, Safety, Privacy, Quality, Instruction Following, Refusal Behaviour, Retrieval/Tool Use, Multi-turn Coherence, Voice & Audio, Vision & Visual, and Fairness & Bias). Each project selects the relevant subset and may override default severity levels to match its risk context — the same finding code might be Critical in a medical deployment and Low in a creative writing tool.

Rubric specification. The complete rubric used by evaluators, including scale definitions, dimension descriptions, finding thresholds, and the linked taxonomy codes for each dimension. The rubric is the contract between the evaluation team and the client — it defines what "good" looks like and at what point a score triggers a finding.

Evaluator profile. The number of evaluators, their domain qualifications, and any specialisation criteria applied during assignment. This section does not identify individual evaluators but provides sufficient detail to assess whether the evaluation team was qualified for the domain.

Quality controls. The gold task strategy, inter-annotator agreement methodology, and any exclusion criteria applied to evaluator submissions that fell below quality thresholds.

Section 3: Findings Registry

The findings registry is the heart of the Evidence Report. It is a structured catalogue of every issue identified during evaluation, classified by taxonomy code, severity, and frequency.

Each finding entry contains:

  • Taxonomy code and label. For example, F1.1 — Fabricated factual claims. The code provides machine-readable classification; the label provides human-readable context.
  • Severity level. Critical, High, Medium, or Low, as determined by the project's severity configuration.
  • Frequency. How often this finding appeared across all evaluated outputs, expressed as both a count and a percentage.
  • Representative examples. Anonymised excerpts from evaluated outputs that illustrate the finding, with evaluator annotations explaining why the output was flagged.
  • Root cause hypothesis. Where patterns in the findings suggest a systemic cause — for example, a model consistently hallucinating in a specific topic area may indicate a training data gap — the findings registry includes a preliminary root cause analysis.

The findings registry is sorted by severity and then by frequency, so the most critical and most common issues appear first. This is a deliberate design choice: the reader's attention should be directed to the findings that pose the greatest risk and affect the most outputs.

Section 4: Statistical Analysis

Raw findings become actionable when they are aggregated statistically. This section provides the quantitative backbone of the report.

Distribution analysis. Rubric scores are presented as distributions, not just averages. A model that scores an average of 3.5 on a 5-point scale could have a tight distribution around 3.5 (consistent, mediocre performance) or a bimodal distribution with peaks at 1 and 5 (inconsistent, unpredictable performance). The distribution tells a different story, and the Evidence Report presents both.

Inter-annotator agreement. Cohen's kappa or Krippendorff's alpha scores for each rubric dimension, showing the degree of consensus among evaluators. High agreement indicates reliable findings. Low agreement on a specific dimension may indicate that the rubric definition needs refinement — which is itself a useful finding.

Severity breakdown. A matrix showing finding counts and percentages by severity level and taxonomy category. This allows the reader to quickly identify which categories of risk are most prevalent and most severe.

Trend analysis. For clients with multiple evaluation engagements, the statistical section includes comparisons against previous reports. Are Critical findings decreasing over time? Is a specific taxonomy category showing improvement or regression? Trend data transforms individual reports into a continuous quality narrative.

Section 5: Evaluator Calibration Metrics

Trust in the report depends on trust in the evaluators. This section provides the evidence that evaluators were calibrated and consistent.

Gold task performance. Each evaluator's accuracy on gold tasks — pre-evaluated outputs with known correct assessments — is reported as a calibration score. Evaluators who fall below the calibration threshold have their submissions excluded from the final analysis and flagged for retraining.

Agreement matrices. Pairwise agreement between evaluators on overlapping assignments shows whether the evaluation team was applying the rubric consistently. Outlier patterns are flagged and investigated.

Time and effort metrics. Average assessment time per output, with outlier detection. Evaluators who complete assessments significantly faster than the mean may be rushing; those significantly slower may be struggling with the rubric. Both patterns are worth investigating.

Section 6: Remediation Guidance

Findings without remediation are observations. Findings with remediation are action items. The final section of the Evidence Report maps each finding category to concrete remediation steps.

Remediation guidance is tiered by effort and impact:

  • Immediate mitigations. Guardrails, filters, or prompt modifications that can reduce the prevalence of Critical and High findings without retraining. These are designed to be implementable within days.
  • Systematic improvements. Training data augmentation, fine-tuning strategies, or architectural changes that address the root causes identified in the findings registry. These are medium-term investments.
  • Monitoring recommendations. Automated checks and periodic re-evaluation schedules to track whether remediations are effective and to detect regression.

Why Structure Matters

An unstructured evaluation produces opinions. A structured evaluation produces evidence. The difference matters when the audience is a regulator, a board, an enterprise client's procurement team, or an internal safety review.

Every section of the Evidence Report exists because someone needs it. The executive summary serves the decision-maker. The methodology section serves the auditor. The findings registry serves the engineering team. The statistical analysis serves the data scientist. The calibration metrics serve the quality assurance function. The remediation guidance serves the product owner.

This is what it means to turn evaluation from a checkbox exercise into a decision-making tool. The Evidence Report is the artefact that makes that transformation concrete.