Risk & Safety

ChatGPT Health Missed Half the Emergencies. That's Not Even the Real Problem.

RD
Ron Diver
Founder

Every hospital in the country is now making clinical decisions influenced by AI. Deterioration alerts fire on patient monitors. Sepsis models flag risk scores. Early warning systems surface patients who may be declining. Clinicians receive these outputs, make a judgment call, and move on.

Nobody tracks whether the judgment was right.

This is a system failure — not a future risk, but a present one — and it is invisible to virtually every quality metric, safety dashboard, and regulatory framework in healthcare today. A study published last month in Nature Medicine provides the clearest evidence yet of what this failure looks like. But the study examines only the surface. The structural problem runs deeper, and it is already operating inside hospitals at scale.

What the Study Proves

Researchers at the Icahn School of Medicine at Mount Sinai conducted the first independent safety evaluation of ChatGPT Health — OpenAI's consumer health triage tool, used by an estimated 40 million people daily, per OpenAI. They tested it against 960 structured clinical scenarios and found a pattern that anyone deploying predictive models in clinical settings should recognize immediately.

The model fails where failure costs the most. Semi-urgent cases were triaged correctly 93% of the time. True emergencies were undertriaged 51.6% of the time — patients with diabetic ketoacidosis and impending respiratory failure told to see a doctor "within 24 to 48 hours" instead of going to an ED.

The model identifies danger, then discounts it. In one scenario, the system flagged early signs of respiratory failure in its own reasoning — then advised the patient to wait. The chain of thought recognized the risk. The output contradicted it. This is not a knowledge failure. It is a reasoning-to-output disconnect, and it is an architectural property of how these models generate answers.

Safety guardrails fire backwards. Suicide-risk alerts triggered in only 4 of 14 crisis scenarios. Less severe ideation produced more alerts. Active self-harm descriptions — the highest-risk presentations — produced fewer. The guardrails are calibrated to surface-level language patterns, not clinical severity.

Social framing shifts clinical recommendations. When a family member minimized symptoms, the system was 11.7 times more likely to downgrade urgency — specifically in the edge cases where correct triage matters most.

These are not random errors. They are systematic. They are predictable. And they are properties of the model architecture, not bugs that will be patched away in the next release.

Where the Study Stops

The Mount Sinai team did rigorous work. The factorial design — 16 controlled conditions per scenario — is the kind of structured stress testing most AI evaluations never attempt.

But the study evaluates what the AI recommends. It does not evaluate what happens after the recommendation is made. No one measures whether a patient follows the advice. No one measures whether a clinician reviewing the output would override it. No one captures feedback from the real world.

This is not a limitation of the study. It is the design boundary of virtually every AI evaluation in healthcare.

We test the output. We never test the system.

The most important part of the system — where a human being receives an AI-generated recommendation and decides what to do with it — is not measured, not captured, and not fed back into anything.

The Layer Nobody Is Watching

Now apply this to the AI systems already running inside hospitals.

Epic's Deterioration Index fires alerts across thousands of hospitals every day. A nurse sees the alert. She assesses the patient. Based on her clinical judgment, her knowledge of that specific patient, and her workload at that moment — three other alerts fired in the last ten minutes and she is triaging her own attention — she makes a decision. She acts, or she dismisses, or she escalates, or she ignores.

That decision is the most consequential moment in the entire system. It is not captured in any structured way. The reason is not recorded. The outcome is not linked back to the alert. The model does not learn whether its prediction was useful, whether the clinician trusted it, or whether the patient was actually deteriorating.

Ask any hospital to show you which alerts were ignored last month and which of those patients subsequently deteriorated. Most cannot. Not because they don't care — because the data does not exist.

The models fail at the extremes, just as the Mount Sinai study documents. And the systems fail to detect and correct it, because there is no feedback loop connecting the alert to the decision to the outcome.

Alert fatigue is treated as a behavioral problem — clinicians ignoring too many alerts. It is actually a system design failure. The alerts do not improve because there is no feedback. Clinicians do not trust them because trust requires evidence, and the evidence is not being collected. Models degrade silently because the thousands of human judgments made against their outputs every day vanish the moment they happen.

This is already happening. It is not monitored. And most hospital leadership cannot explain it, because the infrastructure to make it visible was never built.

Why This Study Matters Beyond ChatGPT Health

The four failure modes the study documents — central tendency bias, weak trajectory reasoning, context sensitivity, guardrail unreliability — are not unique to a consumer chatbot. They are structural properties of large language models applied to clinical decisions. They will appear in any system that generates a recommendation at the boundary between AI output and human action.

The study also proves that aggregate accuracy is the wrong metric. ChatGPT Health's overall performance looks reasonable. It collapses at the extremes — exactly where clinical AI faces its highest-stakes decisions. A deterioration model with strong average sensitivity and poor specificity in the patients who are actually dying has the same failure profile. If you are only measuring aggregate accuracy, you are measuring the wrong thing.

The consumer version of this gap has a Nature Medicine paper. The clinical version has every nurse in every ICU in every hospital running predictive AI — and no paper, because nobody is capturing the data.

What Is Missing

The infrastructure gap is not conceptually complex. What does not exist — and needs to — is a system that captures what clinicians actually do when an AI alert fires: whether they acted, dismissed, or escalated; why; and what happened to the patient afterward. A system that links those decisions to outcomes and surfaces the patterns — which alerts generate noise, which get overridden and shouldn't, where the model is failing by unit, by patient population, by condition. A system that feeds what it learns back to the models and back to the clinical teams, so that both can improve.

This is not speculative. The technical building blocks — FHIR-based event architectures, structured decision capture, outcome linkage — exist. What does not exist is the connective layer that turns alert-level AI outputs into a learning system. Without it, every clinical AI deployment is flying open-loop: generating predictions, influencing decisions, producing outcomes, and learning from none of it.

The Compounding Cost

Inside hospitals right now, clinical AI systems generate alerts that directly influence clinician decisions on patients who are actively receiving care. Those decisions are not captured. Those outcomes are not linked. Those models do not improve from the thousands of human judgments made against their outputs every day.

Every day this gap persists, errors compound. Clinicians lose trust in systems that never demonstrate they are learning. Models degrade because they are never corrected. Patient safety risk increases in a way that is invisible to every dashboard, metric, and quality report in the hospital — because the data that would make it visible was never collected.

The question is not whether AI is accurate enough to be trusted with clinical decisions.

The question is whether anyone is tracking what happens when it's wrong.

Right now, in most hospitals, no one is.


Source: Icahn School of Medicine at Mount Sinai — "Research Identifies Blind Spots in AI Medical Triage", published in Nature Medicine, February 2026.


LumenHealth helps healthcare organizations build AI governance frameworks that match their risk, scale, and mission. Take the assessment to see where you stand.

Assess your organization's AI governance readiness

37 questions across five domains. Free facilitated debrief with your leadership team.

Take the Readiness Assessment →