Open framework · Free to adopt

How to evaluate healthcare AI vendors. Our framework, yours to use.

The governance, safety, implementation, and auditability rubrics we use when we advise safety-net health systems on AI procurement. Adopt them. Adapt them. Borrow whatever is useful.

Read the sample evaluation →Download the rubrics

01 · The problem

AI vendor decisions are board-level. Most frameworks aren’t.

A clinical AI contract is multi-year, multi-million-dollar, and carries regulatory and liability consequences most hospital purchasing processes aren’t built for. The frameworks we see committees using are either vendor-authored scorecards or spreadsheet templates stitched together the week of the decision.

This is the framework we use. Four rubrics, explicit weights, disqualifying thresholds, and a memo format designed to survive legal and regulatory review. Published here in full, including a worked example.

02 · The framework

Four rubrics. Explicit weights. Disqualifying thresholds.

Governance

Regulatory status, contractual terms, vendor stability, policy alignment, liability, and privacy. Failures surface at contract renewal, during an FDA inspection, or when the vendor is acquired.

6 criteria · 2 with disqualifying floors

Safety

Clinical evidence quality, bias and equity testing, failure modes, post-market surveillance, alert burden. A product can be FDA cleared and still fail this rubric.

5 criteria · 3 with disqualifying floors

Implementation

EHR/PACS integration, workflow fit across sites, training, IT burden, migration and rollback, TCO realism. Most failed AI deployments are implementation failures, not technology failures.

6 criteria · 2 with disqualifying floors

Auditability

Can the customer reconstruct what the model saw, what it output, when, to whom, on what version, and what the clinician did with it? Auditability is where governance commitments either hold up or collapse.

5 criteria · 3 with disqualifying floors

01Governance6 criteria

Regulatory status, contractual terms, vendor stability, policy alignment, liability, and privacy. Failures surface at contract renewal, during an FDA inspection, or when the vendor is acquired.

Regulatory status & indications for use
25%
FDA authorization aligned with the intended deployment; PCCP for algorithm updates.
Disqualifying floor · score below 3 removes the vendor
Contractual terms & data portability
22%
SLAs tied to clinical performance, audit rights, termination-grade remedies.
Disqualifying floor · score below 2 removes the vendor
Vendor stability & financial viability
18%
Funding runway, customer concentration, continuity plan if vendor fails or is acquired.
Policy & standards alignment
15%
CHAI, NIST AI RMF, ONC HTI-1 DSI conformance — documented, not gestured at.
Liability allocation & indemnification
10%
Caps reflect real clinical harm exposure; indemnity covers bias-related claims.
Privacy, PHI handling, & secondary data use
10%
BAA terms, subprocessor disclosure, limits on retraining with customer PHI.

02Safety5 criteria

Clinical evidence quality, bias and equity testing, failure modes, post-market surveillance, alert burden. A product can be FDA cleared and still fail this rubric.

Clinical evidence quality & generalizability
28%
Multi-center prospective evidence representative of the deployment population.
Disqualifying floor · score below 3 removes the vendor
Bias & equity testing
22%
Disaggregated performance across race, ethnicity, age, sex, care setting.
Disqualifying floor · score below 2 removes the vendor
Failure mode characterization
18%
Documented conditions under which sensitivity or specificity degrades.
Post-market surveillance
17%
Drift detection, aggregate safety signals, customer notification commitments.
Disqualifying floor · score below 2 removes the vendor
Alert burden & output actionability
15%
Alert fatigue is a safety hazard, not a usability concern.

03Implementation6 criteria

EHR/PACS integration, workflow fit across sites, training, IT burden, migration and rollback, TCO realism. Most failed AI deployments are implementation failures, not technology failures.

EHR & PACS integration architecture
22%
Standards-conformant FHIR / DICOMweb / HL7; multi-PACS without per-site custom work.
Disqualifying floor · score below 2 removes the vendor
Clinical workflow fit across all sites
20%
Rural transfer, non-fellowship readers, ED overreads treated as first-class.
Disqualifying floor · score below 3 removes the vendor
Training & change management
18%
Role-specific, site-specific, measured past 30/60/90 days.
IT burden & infrastructure fit
15%
Realistic FTE, calendar, and network requirements — especially at rural sites.
Migration & rollback path
13%
Tested rollback, parallel-run with incumbent, contractual off-ramp.
Total cost of ownership realism
12%
Five-year interrogable TCO the CFO can audit, including decommissioning costs.

04Auditability5 criteria

Model transparency & documentation
25%
Model card fit for clinical and compliance audiences, not just data scientists.
Disqualifying floor · score below 3 removes the vendor
Logging & event capture
22%
Immutable, queryable, retained per customer policy (often 7+ years).
Disqualifying floor · score below 3 removes the vendor
Production monitoring & drift detection
20%
Site-, subgroup-, and workflow-level drift, surfaced before patients encounter it.
Disqualifying floor · score below 2 removes the vendor
Incident response readiness
18%
AI-specific playbooks, MDR filing experience, customer-side integration.
Data lineage, retention, portability
15%
Every output traces to inputs and model version; full export at termination.

03 · Worked examples

Two cases. Two different lessons.

Ridgeline teaches how a single disqualifying floor overrides a favorable weighted average. Cascadeteaches what to do when both vendors clear every floor — the decision stops being a pick and becomes a structured counteroffer.

Case 1Ridgeline Health — AI for stroke LVO detection.

Ridgeline is a 9-hospital safety-net system evaluating two vendors — NeuralStroke and OmniRad— for AI-assisted stroke detection (LVO: large-vessel occlusion). Epic system-wide; Sectra PACS at the flagship, Fuji Synapse at community sites. What follows is the actual evaluation memo, not a template.

Executive summary

Recommendation: phased selection of NeuralStroke, contingent on three deliverables within 90 days. NeuralStroke has the stronger clinical evidence base for stroke LVO detection, more mature post-market surveillance, and a better-characterized rural transfer workflow than OmniRad. OmniRad offers consolidation economics and a broader module footprint, but its stroke-specific evidence and community-hospital operational track record are thinner and cannot be reconciled within the decision timeline.

The committee should not sign the unconditional contract: two material gaps remain — disaggregated demographic performance for Native American and rural Medicare subgroups, and a tested multi-PACS rollback at the Fuji Synapse community sites. One disqualifying finding stands against OmniRad on auditability (logging and event capture) driven by a default retention period under the Ridgeline clinical record policy.

01Governance

NS 3.57OR 3.20

Criterion	NS	OR	Note
Regulatory status & indications for use Floor 3	4	3	NS has PCCP submitted; OR narrower indication for community sites.
Contractual terms & data portability Floor 2	3	3	Both: uptime SLA + audit rights; neither contracts to clinical performance floor.
Vendor stability & financial viability	3	4	OR has broader customer base and a documented continuity plan.
Policy & standards alignment	4	3	NS publishes HTI-1-conformant model card; OR has it in preparation.
Liability allocation & indemnification	3	3	Both cap at 2x annual fees; neither includes bias indemnity by default.
Privacy, PHI handling, & secondary data use	4	3	NS requires explicit opt-in for secondary use; OR defaults to opt-out.
Weighted score	3.57	3.20
Qualified	Yes	Yes

02Safety

NS 3.58OR 2.62

Criterion	NS	OR	Note
Clinical evidence quality & generalizability Floor 3	4	3	NS has three multi-center studies; OR has one prospective plus two retrospective.
Bias & equity testing Floor 2	3	2	Neither has Native American subgroup data; OR will not commit to deliver.
Failure mode characterization	4	3	NS publishes a quantitative failure mode catalog; OR offers narrative only.
Post-market surveillance Floor 2	4	3	NS provides site-level monitoring data; OR aggregate only.
Alert burden & output actionability	3	2	OR override rates in reference-check trend high; concerning given Aidoc history.
Weighted score	3.58	2.62
Qualified	Yes	Yes

03Implementation

NS 3.55OR 3.48

Criterion	NS	OR	Note
EHR & PACS integration architecture Floor 2	4	4	Both have production multi-PACS Epic deployments.
Clinical workflow fit across all sites Floor 3	4	3	NS has published operational results at comparable multi-tier systems.
Training & change management	3	4	OR provides an embedded clinical transformation specialist.
IT burden & infrastructure fit	4	3	OR's rural bandwidth requirements are not validated at Ridgeline's sites.
Migration & rollback path	3	3	Neither supports tested per-site parallel-run with Aidoc.
Total cost of ownership realism	3	4	OR's TCO is more interrogable; NS omits incumbent decommissioning.
Weighted score	3.55	3.48
Qualified	Yes	Yes

04Auditability

NS 3.64OR 2.58

Criterion	NS	OR	Note
Model transparency & documentation Floor 3	4	3	NS model card is clinical-grade; OR is data-science-grade.
Logging & event capture Floor 3	4	2fail	OR default retention 18 months; Ridgeline policy requires 7 years.
Production monitoring & drift detection Floor 2	3	3	Neither offers contractual notification-latency bound.
Incident response readiness	3	2	OR has not supported a customer MDR filing; NS has one on record.
Data lineage, retention, portability	4	3	NS provides end-to-end lineage with model-version tie-in.
Weighted score	3.64	2.58
Qualified	Yes	No

Interactive · weight panel

Reweight the rubrics for your institution.

Sliders shift the weighted total. Disqualifying floors do not move. Any criterion below its floor removes the vendor from consideration regardless of how the weights are set.

Recommendation under current weights

NeuralStrokeweighted total 3.58 / 5.00· qualifies on all disqualifying floors

01Governance25%

02Safety25%

03Implementation25%

04Auditability25%

NeuralStroke

3.58 / 5.00

Qualified

OmniRad

2.97 / 5.00

Disqualified · locked

Disqualifying floors · locked

These are not overridden by the weighting above. A score below the floor on any single criterion disqualifies the vendor.

NeuralStroke

No floor failures. Passes all disqualifying criteria.

OmniRad

FailedAuditability › Logging & event capturescore 2 · floor 3

Synthesis

Averaged weighted scores favor NeuralStroke on every rubric, but the decision is not an average. OmniRad’s disqualification on auditability is the dominant signal: an AI tool that cannot be reconstructed during an adverse event review creates unbounded liability for the operator. The strongest counter-argument is consolidation — OmniRad brings a module roadmap NeuralStroke cannot match. That tradeoff is real but not dispositive in Q2 2026.

Conditions on signing (all required)

NeuralStroke delivers disaggregated demographic performance for Native American and rural Medicare subgroups within 90 days; if not, Ridgeline exercises termination without penalty.
NeuralStroke provides the approved FDA PCCP letter and a written mapping of algorithm update types to customer notification commitments within 60 days.
NeuralStroke agrees to a contractual notification-latency bound for drift detection (7 days or less) and site-level monitoring access for Ridgeline’s compliance team.
Community-site go-live gated on Aidoc PE module performance review and stroke coordinator workflow validation at Wyoming and Kansas hospitals.
Contractual off-ramp permits revert to an Aidoc-only configuration at any community site during a 12-month stabilization window.

Tradeoffs accepted

Single-use stroke platform rather than a consolidated multi-module radiology AI footprint.
Smaller vendor customer base than OmniRad; elevated acquihire and continuity risk managed via source code escrow.
Loss of OmniRad’s embedded clinical transformation specialist; Ridgeline must fund in-house change management.

Download the Ridgeline memo (PDF) →

Case 2Cascade Community Health — ambient clinical documentation.

Cascade is an 8-clinic FQHC with 60 clinicians and an integrated behavioral health model (18 BH providers embedded alongside primary care). Medicaid-majority patient panel. athenahealth system-wide. They are evaluating two ambient documentation vendors — Verba Health and Luma Notes— for a 60-seat, three-year rollout.

Executive summary

Recommendation: conditional counteroffer, vendor selection deferred 30 days. Both vendors clear every disqualifying floor. Weighted scores favor Verba on three of four rubrics; Luma leads on implementation by 0.25. The governance gap (0.53) is concentrated in a single line item — default handling of de-identified transcript reuse — which is negotiable, not structural.

The committee should not pick between the vendors today. Issue three written amendments to both and let the procurement turn on which vendor will commit in writing. For a 68% Medicaid patient panel, an opt-out default on de-identified transcript reuse is the line item that matters most and the one the memo should focus negotiation on.

01Governance

V 3.43L 2.90

Criterion	V	L	Note
Regulatory status & indications for use Floor 3	3	3	Ambient scribes are not regulated as SaMD; both have appropriate enterprise SaaS posture.
Contractual terms & data portability Floor 2	3	3	Uptime SLAs and audit rights; neither contracts to a note-accuracy performance floor.
Vendor stability & financial viability	4	3	Verba: 400+ health-system customers, Series D. Luma: ~80 customers, Series B, 14mo runway.
Policy & standards alignment	4	3	Verba publishes an HTI-1-leaning model card. Luma has one in draft, not customer-accessible.
Liability allocation & indemnification	3	3	Both cap at 2x annual fees. Neither includes bias indemnity or clinical-error coverage.
Privacy, PHI handling, & secondary data use	4	2	Verba requires explicit opt-in for de-identified transcript use. Luma defaults to opt-out — a material gap for a Medicaid-majority org.
Weighted score	3.43	2.90
Qualified	Yes	Yes

02Safety

V 2.96L 2.93

Criterion	V	L	Note
Clinical evidence quality & generalizability Floor 3	3	3	Both have vendor-sponsored time-saving studies. Neither has RCT evidence on note accuracy or downstream decision quality.
Bias & equity testing Floor 2	2	2	Both clear the floor at 2. Neither has published accent/dialect parity results for Spanish-speaking or AAVE-speaking patients.
Failure mode characterization	4	3	Verba publishes a hallucination-rate catalog by note section. Luma tracks internally; not customer-accessible.
Post-market surveillance Floor 2	3	3	Both run basic drift monitoring. Neither offers a contractually bounded notification window.
Alert burden & output actionability	3	4	Note quality / physician rework rates: Luma trends lower in reference checks at community-health sites.
Weighted score	2.96	2.93
Qualified	Yes	Yes

03Implementation

V 3.42L 3.67

Criterion	V	L	Note
EHR & PACS integration architecture Floor 2	4	4	Both are athenahealth-certified with documented FHIR patterns.
Clinical workflow fit across all sites Floor 3	4	3	Verba has a documented integrated-BH workflow. Luma treats BH encounters identically to primary care — a gap for 18 of 60 providers.
Training & change management	3	4	Luma has a simpler rollout; physicians productive in under 2 weeks vs. 4-6 for Verba.
IT burden & infrastructure fit	3	4	Luma: browser-only, minimal IT burden. Verba: requires EHR sidecar agents.
Migration & rollback path	3	3	Both support contract cancellation; neither blocks the other from running in parallel.
Total cost of ownership realism	3	4	Luma $89/seat/month vs. Verba $149. 5yr TCO gap ~$220K at 60 seats even accounting for rework.
Weighted score	3.42	3.67
Qualified	Yes	Yes

04Auditability

V 3.62L 3.00

Criterion	V	L	Note
Model transparency & documentation Floor 3	4	3	Verba model card written for clinical/compliance audiences. Luma is data-science-grade only.
Logging & event capture Floor 3	4	3	Verba default retention 7yr, configurable longer. Luma default 5yr, configurable to 7yr but not documented for Medicaid audit posture.
Production monitoring & drift detection Floor 2	3	3	Neither offers customer-side dashboards; both publish aggregate drift reports.
Incident response readiness	3	3	Both have an incident playbook; neither has MDR-equivalent experience (not applicable to scribes).
Data lineage, retention, portability	4	3	Verba: transcript-to-note lineage with model version on every output. Luma: model version in metadata only.
Weighted score	3.62	3.00
Qualified	Yes	Yes

Synthesis

Neither vendor is disqualified. Neither vendor dominates. Verba carries a cleaner governance and auditability posture — published model card, opt-in default for de-identified reuse, longer documented retention. Luma reads materially better on implementation: lower per-seat price, simpler IT footprint, and faster time-to-productive for clinicians. On safety, the two are indistinguishable within the measurement resolution of this rubric. The single weighted-score gap that matters is governance, and within governance the single line item that drives it is privacy_and_data_use (Verba 4, Luma 2). That line item is a default, not a floor — and defaults are negotiable at a 60-seat scale.

Counteroffer issued to both vendors (all three required to advance)

De-identified transcript reuse defaults to opt-in at the organizational level, contractually. Any reuse for model training requires a written amendment Cascade signs separately. 90-day deliverable.
Publish a customer-accessible model card written for clinical and compliance audiences (not data-science grade): input/output scope, training data provenance, known failure modes by note section, and update cadence. 60-day deliverable.
Deliver a written integrated-BH workflow for the 18 embedded behavioral health providers, with a 30-day pilot at one BH-integrated clinic before system-wide rollout. 45-day deliverable; pilot gate on go-live.

Tradeoffs accepted

Neither vendor has RCT-grade evidence on note accuracy or downstream decision quality. Cascade runs a 90-day internal note-review sampling protocol on whichever vendor is selected.
Reference customers for 60-seat FQHCs with integrated BH are thin. Diligence requires at least two operationally similar references per vendor before signing.
If Verba is selected, Cascade accepts a ~$43K/year uplift over Luma’s lowest bid in exchange for the stronger governance posture. If Luma is selected, Cascade accepts a model-card and lineage gap that must be closed contractually within the first year.

Download the Cascade memo (PDF) →

04 · The method

How a case becomes a memo.

Case inputs

Structured case facts, incumbents, constraints, committee priorities.

Rubric scoring

Four rubrics applied per vendor. Anchored 1-5 scoring with evidence.

Synthesis

Floors enforced first. Weighted scores second. Tradeoffs made explicit.

Recommendation memo

Conditions, risks, open questions. Format built to survive legal review.

Human judgment enters at two points: the structured case inputs (which facts matter, which do not) and the synthesis (which tradeoffs are acceptable for this institution). The rubric scoring is where LLM assistance pays off — once the anchors are written, scoring a vendor response against a criterion is a consistent, legible task that can be accelerated without losing rigor.

Disqualifying floors override weighted scores. Any criterion with a documented floor removes a vendor from consideration if that floor is missed, regardless of how well they score on the remaining weighted criteria. This is the part that breaks the “weighted average wins” instinct most purchasing frameworks inherit.

What this framework is not. It does not verify FDA clearance on your behalf, negotiate contractual terms, or replace clinical safety governance. It produces a defensible decision record — the written artifact a compliance officer, board member, or regulator can read and understand later. The facilitation work, contract negotiation, and implementation oversight still happen in the room with the committee.

05 · Adopt it

Take this framework for your institution.

YAML

Rubrics

All four rubrics with criteria, weights, scoring anchors, disqualifying floors, and committee probes. Machine-readable, human-editable.

Download rubrics.yaml →

PDF · Case 1

Ridgeline — stroke LVO

The disqualifying-floor case. How a single auditability gap overrides a favorable weighted average. Board-ready memo format.

Download ridgeline-stroke-lvo.pdf →

PDF · Case 2

Cascade — ambient scribes

The conditional-counteroffer case. What to do when no one fails a floor and the decision becomes a written amendment, not a pick.

Download cascade-ambient-scribes.pdf →

Published under MIT license. Attribution appreciated, not required. Adapt to your institutional context. If you use it in a live procurement, we would love to hear how it held up — note at the bottom of the page.

06 · Who built this

Lumen advises safety-net health systems on AI governance.

We work with FQHCs, rural clinics, tribal health programs, and community hospitals navigating AI procurement, governance setup, and regulatory readiness. This framework came out of real engagements — not a whiteboard.

If you want us to facilitate your next vendor evaluation or run a governance review with your board, write directly. No form. No intake queue. You’ll talk to a person.

hello@lumenhealth.ai →Or: start with our governance assessment

Open framework · MIT license · Revised 2026-04-19

Vendor names (NeuralStroke, OmniRad, Verba Health, Luma Notes) are illustrative; cases are composites based on engagement patterns.