Open framework · Free to adopt

How to evaluate healthcare AI vendors. Our framework, yours to use.

The governance, safety, implementation, and auditability rubrics we use when we advise safety-net health systems on AI procurement. Adopt them. Adapt them. Borrow whatever is useful.

01 · The problem

AI vendor decisions are board-level. Most frameworks aren’t.

A clinical AI contract is multi-year, multi-million-dollar, and carries regulatory and liability consequences most hospital purchasing processes aren’t built for. The frameworks we see committees using are either vendor-authored scorecards or spreadsheet templates stitched together the week of the decision.

This is the framework we use. Four rubrics, explicit weights, disqualifying thresholds, and a memo format designed to survive legal and regulatory review. Published here in full, including a worked example.

02 · The framework

Four rubrics. Explicit weights. Disqualifying thresholds.

01Governance

Regulatory status, contractual terms, vendor stability, policy alignment, liability, and privacy. Failures surface at contract renewal, during an FDA inspection, or when the vendor is acquired.

  • Regulatory status & indications for use

    25%

    FDA authorization aligned with the intended deployment; PCCP for algorithm updates.

    Disqualifying floor · score below 3 removes the vendor

  • Contractual terms & data portability

    22%

    SLAs tied to clinical performance, audit rights, termination-grade remedies.

    Disqualifying floor · score below 2 removes the vendor

  • Vendor stability & financial viability

    18%

    Funding runway, customer concentration, continuity plan if vendor fails or is acquired.

  • Policy & standards alignment

    15%

    CHAI, NIST AI RMF, ONC HTI-1 DSI conformance — documented, not gestured at.

  • Liability allocation & indemnification

    10%

    Caps reflect real clinical harm exposure; indemnity covers bias-related claims.

  • Privacy, PHI handling, & secondary data use

    10%

    BAA terms, subprocessor disclosure, limits on retraining with customer PHI.

02Safety

Clinical evidence quality, bias and equity testing, failure modes, post-market surveillance, alert burden. A product can be FDA cleared and still fail this rubric.

  • Clinical evidence quality & generalizability

    28%

    Multi-center prospective evidence representative of the deployment population.

    Disqualifying floor · score below 3 removes the vendor

  • Bias & equity testing

    22%

    Disaggregated performance across race, ethnicity, age, sex, care setting.

    Disqualifying floor · score below 2 removes the vendor

  • Failure mode characterization

    18%

    Documented conditions under which sensitivity or specificity degrades.

  • Post-market surveillance

    17%

    Drift detection, aggregate safety signals, customer notification commitments.

    Disqualifying floor · score below 2 removes the vendor

  • Alert burden & output actionability

    15%

    Alert fatigue is a safety hazard, not a usability concern.

03Implementation

EHR/PACS integration, workflow fit across sites, training, IT burden, migration and rollback, TCO realism. Most failed AI deployments are implementation failures, not technology failures.

  • EHR & PACS integration architecture

    22%

    Standards-conformant FHIR / DICOMweb / HL7; multi-PACS without per-site custom work.

    Disqualifying floor · score below 2 removes the vendor

  • Clinical workflow fit across all sites

    20%

    Rural transfer, non-fellowship readers, ED overreads treated as first-class.

    Disqualifying floor · score below 3 removes the vendor

  • Training & change management

    18%

    Role-specific, site-specific, measured past 30/60/90 days.

  • IT burden & infrastructure fit

    15%

    Realistic FTE, calendar, and network requirements — especially at rural sites.

  • Migration & rollback path

    13%

    Tested rollback, parallel-run with incumbent, contractual off-ramp.

  • Total cost of ownership realism

    12%

    Five-year interrogable TCO the CFO can audit, including decommissioning costs.

04Auditability

Can the customer reconstruct what the model saw, what it output, when, to whom, on what version, and what the clinician did with it? Auditability is where governance commitments either hold up or collapse.

  • Model transparency & documentation

    25%

    Model card fit for clinical and compliance audiences, not just data scientists.

    Disqualifying floor · score below 3 removes the vendor

  • Logging & event capture

    22%

    Immutable, queryable, retained per customer policy (often 7+ years).

    Disqualifying floor · score below 3 removes the vendor

  • Production monitoring & drift detection

    20%

    Site-, subgroup-, and workflow-level drift, surfaced before patients encounter it.

    Disqualifying floor · score below 2 removes the vendor

  • Incident response readiness

    18%

    AI-specific playbooks, MDR filing experience, customer-side integration.

  • Data lineage, retention, portability

    15%

    Every output traces to inputs and model version; full export at termination.

03 · Worked examples

Two cases. Two different lessons.

Ridgeline teaches how a single disqualifying floor overrides a favorable weighted average. Cascadeteaches what to do when both vendors clear every floor — the decision stops being a pick and becomes a structured counteroffer.

Case 1Ridgeline Health — AI for stroke LVO detection.

Ridgeline is a 9-hospital safety-net system evaluating two vendors — NeuralStroke and OmniRad— for AI-assisted stroke detection (LVO: large-vessel occlusion). Epic system-wide; Sectra PACS at the flagship, Fuji Synapse at community sites. What follows is the actual evaluation memo, not a template.

Executive summary

Recommendation: phased selection of NeuralStroke, contingent on three deliverables within 90 days. NeuralStroke has the stronger clinical evidence base for stroke LVO detection, more mature post-market surveillance, and a better-characterized rural transfer workflow than OmniRad. OmniRad offers consolidation economics and a broader module footprint, but its stroke-specific evidence and community-hospital operational track record are thinner and cannot be reconciled within the decision timeline.

The committee should not sign the unconditional contract: two material gaps remain — disaggregated demographic performance for Native American and rural Medicare subgroups, and a tested multi-PACS rollback at the Fuji Synapse community sites. One disqualifying finding stands against OmniRad on auditability (logging and event capture) driven by a default retention period under the Ridgeline clinical record policy.

01Governance

NS 3.57OR 3.20
CriterionNSORNote

Regulatory status & indications for use

Floor 3

43NS has PCCP submitted; OR narrower indication for community sites.

Contractual terms & data portability

Floor 2

33Both: uptime SLA + audit rights; neither contracts to clinical performance floor.

Vendor stability & financial viability

34OR has broader customer base and a documented continuity plan.

Policy & standards alignment

43NS publishes HTI-1-conformant model card; OR has it in preparation.

Liability allocation & indemnification

33Both cap at 2x annual fees; neither includes bias indemnity by default.

Privacy, PHI handling, & secondary data use

43NS requires explicit opt-in for secondary use; OR defaults to opt-out.
Weighted score3.573.20
QualifiedYesYes

02Safety

NS 3.58OR 2.62
CriterionNSORNote

Clinical evidence quality & generalizability

Floor 3

43NS has three multi-center studies; OR has one prospective plus two retrospective.

Bias & equity testing

Floor 2

32Neither has Native American subgroup data; OR will not commit to deliver.

Failure mode characterization

43NS publishes a quantitative failure mode catalog; OR offers narrative only.

Post-market surveillance

Floor 2

43NS provides site-level monitoring data; OR aggregate only.

Alert burden & output actionability

32OR override rates in reference-check trend high; concerning given Aidoc history.
Weighted score3.582.62
QualifiedYesYes

03Implementation

NS 3.55OR 3.48
CriterionNSORNote

EHR & PACS integration architecture

Floor 2

44Both have production multi-PACS Epic deployments.

Clinical workflow fit across all sites

Floor 3

43NS has published operational results at comparable multi-tier systems.

Training & change management

34OR provides an embedded clinical transformation specialist.

IT burden & infrastructure fit

43OR's rural bandwidth requirements are not validated at Ridgeline's sites.

Migration & rollback path

33Neither supports tested per-site parallel-run with Aidoc.

Total cost of ownership realism

34OR's TCO is more interrogable; NS omits incumbent decommissioning.
Weighted score3.553.48
QualifiedYesYes

04Auditability

NS 3.64OR 2.58
CriterionNSORNote

Model transparency & documentation

Floor 3

43NS model card is clinical-grade; OR is data-science-grade.

Logging & event capture

Floor 3

42failOR default retention 18 months; Ridgeline policy requires 7 years.

Production monitoring & drift detection

Floor 2

33Neither offers contractual notification-latency bound.

Incident response readiness

32OR has not supported a customer MDR filing; NS has one on record.

Data lineage, retention, portability

43NS provides end-to-end lineage with model-version tie-in.
Weighted score3.642.58
QualifiedYesNo

Column NS refers to NeuralStroke. Column OR refers to OmniRad.

Interactive · weight panel

Reweight the rubrics for your institution.

Sliders shift the weighted total. Disqualifying floors do not move. Any criterion below its floor removes the vendor from consideration regardless of how the weights are set.

Recommendation under current weights

NeuralStrokeweighted total 3.58 / 5.00· qualifies on all disqualifying floors
25%
25%
25%
25%

NeuralStroke

3.58 / 5.00

Qualified

OmniRad

2.97 / 5.00

Disqualified · locked

Disqualifying floors · locked

These are not overridden by the weighting above. A score below the floor on any single criterion disqualifies the vendor.

NeuralStroke

No floor failures. Passes all disqualifying criteria.

OmniRad

  • FailedAuditabilityLogging & event capturescore 2 · floor 3

Synthesis

Averaged weighted scores favor NeuralStroke on every rubric, but the decision is not an average. OmniRad’s disqualification on auditability is the dominant signal: an AI tool that cannot be reconstructed during an adverse event review creates unbounded liability for the operator. The strongest counter-argument is consolidation — OmniRad brings a module roadmap NeuralStroke cannot match. That tradeoff is real but not dispositive in Q2 2026.

Conditions on signing (all required)

  1. NeuralStroke delivers disaggregated demographic performance for Native American and rural Medicare subgroups within 90 days; if not, Ridgeline exercises termination without penalty.
  2. NeuralStroke provides the approved FDA PCCP letter and a written mapping of algorithm update types to customer notification commitments within 60 days.
  3. NeuralStroke agrees to a contractual notification-latency bound for drift detection (7 days or less) and site-level monitoring access for Ridgeline’s compliance team.
  4. Community-site go-live gated on Aidoc PE module performance review and stroke coordinator workflow validation at Wyoming and Kansas hospitals.
  5. Contractual off-ramp permits revert to an Aidoc-only configuration at any community site during a 12-month stabilization window.

Tradeoffs accepted

  • Single-use stroke platform rather than a consolidated multi-module radiology AI footprint.
  • Smaller vendor customer base than OmniRad; elevated acquihire and continuity risk managed via source code escrow.
  • Loss of OmniRad’s embedded clinical transformation specialist; Ridgeline must fund in-house change management.

Case 2Cascade Community Health — ambient clinical documentation.

Cascade is an 8-clinic FQHC with 60 clinicians and an integrated behavioral health model (18 BH providers embedded alongside primary care). Medicaid-majority patient panel. athenahealth system-wide. They are evaluating two ambient documentation vendors — Verba Health and Luma Notes— for a 60-seat, three-year rollout.

Executive summary

Recommendation: conditional counteroffer, vendor selection deferred 30 days. Both vendors clear every disqualifying floor. Weighted scores favor Verba on three of four rubrics; Luma leads on implementation by 0.25. The governance gap (0.53) is concentrated in a single line item — default handling of de-identified transcript reuse — which is negotiable, not structural.

The committee should not pick between the vendors today. Issue three written amendments to both and let the procurement turn on which vendor will commit in writing. For a 68% Medicaid patient panel, an opt-out default on de-identified transcript reuse is the line item that matters most and the one the memo should focus negotiation on.

01Governance

V 3.43L 2.90
CriterionVLNote

Regulatory status & indications for use

Floor 3

33Ambient scribes are not regulated as SaMD; both have appropriate enterprise SaaS posture.

Contractual terms & data portability

Floor 2

33Uptime SLAs and audit rights; neither contracts to a note-accuracy performance floor.

Vendor stability & financial viability

43Verba: 400+ health-system customers, Series D. Luma: ~80 customers, Series B, 14mo runway.

Policy & standards alignment

43Verba publishes an HTI-1-leaning model card. Luma has one in draft, not customer-accessible.

Liability allocation & indemnification

33Both cap at 2x annual fees. Neither includes bias indemnity or clinical-error coverage.

Privacy, PHI handling, & secondary data use

42Verba requires explicit opt-in for de-identified transcript use. Luma defaults to opt-out — a material gap for a Medicaid-majority org.
Weighted score3.432.90
QualifiedYesYes

02Safety

V 2.96L 2.93
CriterionVLNote

Clinical evidence quality & generalizability

Floor 3

33Both have vendor-sponsored time-saving studies. Neither has RCT evidence on note accuracy or downstream decision quality.

Bias & equity testing

Floor 2

22Both clear the floor at 2. Neither has published accent/dialect parity results for Spanish-speaking or AAVE-speaking patients.

Failure mode characterization

43Verba publishes a hallucination-rate catalog by note section. Luma tracks internally; not customer-accessible.

Post-market surveillance

Floor 2

33Both run basic drift monitoring. Neither offers a contractually bounded notification window.

Alert burden & output actionability

34Note quality / physician rework rates: Luma trends lower in reference checks at community-health sites.
Weighted score2.962.93
QualifiedYesYes

03Implementation

V 3.42L 3.67
CriterionVLNote

EHR & PACS integration architecture

Floor 2

44Both are athenahealth-certified with documented FHIR patterns.

Clinical workflow fit across all sites

Floor 3

43Verba has a documented integrated-BH workflow. Luma treats BH encounters identically to primary care — a gap for 18 of 60 providers.

Training & change management

34Luma has a simpler rollout; physicians productive in under 2 weeks vs. 4-6 for Verba.

IT burden & infrastructure fit

34Luma: browser-only, minimal IT burden. Verba: requires EHR sidecar agents.

Migration & rollback path

33Both support contract cancellation; neither blocks the other from running in parallel.

Total cost of ownership realism

34Luma $89/seat/month vs. Verba $149. 5yr TCO gap ~$220K at 60 seats even accounting for rework.
Weighted score3.423.67
QualifiedYesYes

04Auditability

V 3.62L 3.00
CriterionVLNote

Model transparency & documentation

Floor 3

43Verba model card written for clinical/compliance audiences. Luma is data-science-grade only.

Logging & event capture

Floor 3

43Verba default retention 7yr, configurable longer. Luma default 5yr, configurable to 7yr but not documented for Medicaid audit posture.

Production monitoring & drift detection

Floor 2

33Neither offers customer-side dashboards; both publish aggregate drift reports.

Incident response readiness

33Both have an incident playbook; neither has MDR-equivalent experience (not applicable to scribes).

Data lineage, retention, portability

43Verba: transcript-to-note lineage with model version on every output. Luma: model version in metadata only.
Weighted score3.623.00
QualifiedYesYes

Column V refers to Verba Health. Column L refers to Luma Notes.

Synthesis

Neither vendor is disqualified. Neither vendor dominates. Verba carries a cleaner governance and auditability posture — published model card, opt-in default for de-identified reuse, longer documented retention. Luma reads materially better on implementation: lower per-seat price, simpler IT footprint, and faster time-to-productive for clinicians. On safety, the two are indistinguishable within the measurement resolution of this rubric. The single weighted-score gap that matters is governance, and within governance the single line item that drives it is privacy_and_data_use (Verba 4, Luma 2). That line item is a default, not a floor — and defaults are negotiable at a 60-seat scale.

Counteroffer issued to both vendors (all three required to advance)

  1. De-identified transcript reuse defaults to opt-in at the organizational level, contractually. Any reuse for model training requires a written amendment Cascade signs separately. 90-day deliverable.
  2. Publish a customer-accessible model card written for clinical and compliance audiences (not data-science grade): input/output scope, training data provenance, known failure modes by note section, and update cadence. 60-day deliverable.
  3. Deliver a written integrated-BH workflow for the 18 embedded behavioral health providers, with a 30-day pilot at one BH-integrated clinic before system-wide rollout. 45-day deliverable; pilot gate on go-live.

Tradeoffs accepted

  • Neither vendor has RCT-grade evidence on note accuracy or downstream decision quality. Cascade runs a 90-day internal note-review sampling protocol on whichever vendor is selected.
  • Reference customers for 60-seat FQHCs with integrated BH are thin. Diligence requires at least two operationally similar references per vendor before signing.
  • If Verba is selected, Cascade accepts a ~$43K/year uplift over Luma’s lowest bid in exchange for the stronger governance posture. If Luma is selected, Cascade accepts a model-card and lineage gap that must be closed contractually within the first year.

04 · The method

How a case becomes a memo.

01

Case inputs

Structured case facts, incumbents, constraints, committee priorities.

02

Rubric scoring

Four rubrics applied per vendor. Anchored 1-5 scoring with evidence.

03

Synthesis

Floors enforced first. Weighted scores second. Tradeoffs made explicit.

04

Recommendation memo

Conditions, risks, open questions. Format built to survive legal review.

Human judgment enters at two points: the structured case inputs (which facts matter, which do not) and the synthesis (which tradeoffs are acceptable for this institution). The rubric scoring is where LLM assistance pays off — once the anchors are written, scoring a vendor response against a criterion is a consistent, legible task that can be accelerated without losing rigor.

Disqualifying floors override weighted scores. Any criterion with a documented floor removes a vendor from consideration if that floor is missed, regardless of how well they score on the remaining weighted criteria. This is the part that breaks the “weighted average wins” instinct most purchasing frameworks inherit.

What this framework is not. It does not verify FDA clearance on your behalf, negotiate contractual terms, or replace clinical safety governance. It produces a defensible decision record — the written artifact a compliance officer, board member, or regulator can read and understand later. The facilitation work, contract negotiation, and implementation oversight still happen in the room with the committee.

06 · Who built this

Lumen advises safety-net health systems on AI governance.

We work with FQHCs, rural clinics, tribal health programs, and community hospitals navigating AI procurement, governance setup, and regulatory readiness. This framework came out of real engagements — not a whiteboard.

If you want us to facilitate your next vendor evaluation or run a governance review with your board, write directly. No form. No intake queue. You’ll talk to a person.

Open framework · MIT license · Revised 2026-04-19

Vendor names (NeuralStroke, OmniRad, Verba Health, Luma Notes) are illustrative; cases are composites based on engagement patterns.