Healthcare Case Study

When AI Fails in Healthcare: Lessons from Recent Deployment Incidents

SingleAxis Research February 15, 2026

When AI Fails in Healthcare: Lessons from Recent Deployment Incidents

Healthcare AI is not a research curiosity. Large language models are summarising patient records, clinical decision support systems are recommending treatments, and radiology AI is triaging imaging studies in hospitals today. When these systems fail, the consequences are measured in patient harm, not user complaints.

The incidents documented in this article are drawn from published reports, FDA adverse event databases, and peer-reviewed analyses of healthcare AI deployments. They are not edge cases. They are failure patterns — systematic, predictable, and preventable through structured evaluation.

The Failure Taxonomy

Healthcare AI failures cluster into recognisable categories that map directly to the evaluation taxonomy used in structured assessment. Understanding these categories is the first step toward preventing them.

Faithfulness Failures: When the Model Invents Clinical Facts

The most dangerous category of healthcare AI failure is faithfulness — the model generating claims that are not supported by the source material. In a clinical context, a hallucinated fact is not merely wrong. It is wrong in a way that looks right to a tired clinician making rapid decisions.

Hallucinated dosages. Clinical decision support systems have generated medication dosage recommendations that were not present in any referenced guideline. In one documented incident, a large language model integrated into a hospital's clinical workflow recommended a dosage for a paediatric patient that was calculated for adult weight. The recommendation was fluent, formatted correctly, and cited a real guideline — but the specific dosage figure was fabricated. The clinician caught it. The next clinician might not.

Fabricated references. When AI systems are asked to support clinical recommendations with evidence, they sometimes generate citations to papers that do not exist, or cite real papers that do not actually support the stated claim. A clinician who trusts the citation without verifying it — a reasonable behaviour when the system has been reliable in previous interactions — may base treatment decisions on non-existent evidence.

Invented history details. EHR summarisation tools have been observed inserting clinical details into patient summaries that were not present in the source records. A patient allergy that was never documented, a prior procedure that never occurred, a family history detail that was confabulated from statistical priors rather than read from the chart. These insertions are particularly insidious because they are plausible — the model generates what a typical patient record might contain, rather than what this specific patient's record actually contains.

In the SingleAxis taxonomy, these failures fall under Faithfulness codes F1.1 (fabricated factual claims), F1.2 (unsupported claims), and F2.1 (misrepresentation of source material). In a healthcare evaluation, all three are classified as Critical severity by default.

Safety Failures: When the Model Misses What Matters

Safety failures in healthcare AI are failures of omission as much as commission. The model does not need to recommend something harmful. It simply needs to fail to flag something dangerous.

Missed contraindications. Drug interaction checking is one of the most straightforward applications of AI in healthcare, and one where failures are most consequential. Systems have failed to flag known contraindications when the interaction data was presented in an unusual format, when the patient was on a large number of medications (context window limitations), or when the contraindication was conditional on a comorbidity that appeared elsewhere in the record.

Wrong severity classification. Triage systems that classify the urgency of clinical findings have demonstrated a persistent failure pattern: they correctly identify an abnormal finding but assign it an incorrect severity. A finding that warrants urgent follow-up is classified as routine. The patient's imaging study goes to the bottom of the reading queue instead of the top. The radiologist reads it three days later instead of three hours later.

Omitted critical alerts. In complex clinical scenarios — patients with multiple conditions, on multiple medications, with extensive clinical histories — AI systems have been observed silently dropping critical information. Not hallucinating, not getting it wrong, but simply omitting it from the summary or analysis because the input exceeded the model's effective processing capacity.

These failures map to Safety codes S1.1 (harmful recommendations), S2.1 (failure to flag safety issues), and S3.1 (inadequate safety boundaries). In healthcare evaluations, the severity classification for safety failures is almost always Critical, because the downstream consequences are patient harm.

Quality Failures: When the Output Degrades Silently

Quality failures are less dramatic than faithfulness or safety failures, but they are more pervasive, and their cumulative effect undermines clinical trust in AI systems.

Inconsistent output formatting. Clinicians develop reading patterns. They expect vital signs in a specific location, lab results in a specific format, assessment and plan in a specific structure. When an AI summarisation tool produces inconsistent formatting — sometimes listing medications alphabetically, sometimes by category, sometimes by date prescribed — it forces the clinician to slow down and parse the structure each time, eliminating the efficiency gains that justified the tool's adoption.

Incomplete summarisation. Long patient records are routinely truncated or incompletely summarised. The model processes the most recent entries and silently drops historical context. A patient's surgical history from five years ago, a resolved cancer diagnosis, a prior adverse drug reaction — these details disappear from the summary not because the model judged them irrelevant, but because they fell outside its effective context window.

Degradation under load. Healthcare AI systems that perform well in testing often degrade during peak clinical hours when they are processing higher volumes of requests. Response latency increases, and in some architectures, response quality decreases as the system makes trade-offs to maintain throughput. This degradation is rarely tested in pre-deployment evaluation because evaluation environments do not replicate production load conditions.

What Evaluation Should Have Caught

Every failure described above was detectable through structured human evaluation. Not through automated testing alone, and not through benchmark performance — through trained evaluators assessing real-world outputs against a domain-appropriate taxonomy.

A structured healthcare AI evaluation using the SingleAxis methodology would have included:

Faithfulness verification. Evaluators with clinical training compare every factual claim in the AI output against the source material. Every dosage, every citation, every clinical detail is checked for grounding. Fabricated content is flagged with taxonomy code, severity, and root cause annotation.

Safety boundary testing. Evaluators present the system with scenarios that include known contraindications, multi-drug interactions, and clinical urgency signals. Failures to detect or appropriately flag safety-critical information are documented with full traceability.

Quality consistency assessment. Evaluators assess multiple outputs for structural consistency, completeness, and adherence to clinical documentation standards. Degradation patterns are identified through statistical analysis across the evaluation dataset.

Edge case probing. Evaluators deliberately test the boundaries — long records, complex patients, unusual presentations, ambiguous clinical scenarios. These are the cases where AI systems fail, and they are systematically underrepresented in benchmark datasets.

The Prevention Framework

Healthcare AI failure prevention is not a technology problem. It is an evaluation problem. The technology will continue to improve. But improvement without structured assessment is improvement without evidence — and in regulated healthcare environments, evidence is everything.

Organisations deploying AI in healthcare should adopt a three-layer evaluation strategy:

Pre-deployment Evidence Report. Before any AI system touches patient data in production, a comprehensive evaluation using domain-qualified evaluators should produce an Evidence Report that documents performance across all relevant taxonomy categories with healthcare-calibrated severity levels.

Periodic re-evaluation. AI system performance is not static. Model updates, data drift, and changing clinical workflows all affect output quality. Quarterly or semi-annual re-evaluation maintains the evidence base and detects degradation before it reaches patients.

Incident-triggered assessment. When adverse events or near-misses involving AI systems are reported, a focused evaluation targeting the specific failure mode should be conducted immediately. This is both a safety measure and a regulatory requirement under most healthcare AI governance frameworks.

The tools to prevent healthcare AI failures exist today. The question is whether organisations will use them before the next incident, or after it.