# Lumen Healthcare AI Vendor Evaluation — Rubrics
# Published under MIT license. Adapt freely.
# Source: https://lumenhealth.ai/vendor-evaluation

# ============================================================
# Governance
# ============================================================

id: governance
name: Governance
version: "0.1"
description: >
  Evaluates whether a vendor's regulatory, contractual, and corporate posture is
  compatible with a health system's obligations as the legal operator of an AI
  tool used in patient care. Governance failures are rarely discovered at the
  demo; they surface at contract renewal, during an FDA inspection, after a
  near-miss, or when a vendor is acquired. This rubric is intended to expose
  those failures before the contract is signed.

criteria:
  - id: regulatory_status
    name: Regulatory status and indications for use
    weight: 0.25
    description: >
      Whether the product has appropriate FDA authorization (510(k), De Novo, or
      PMA), whether the cleared indications for use match the intended clinical
      deployment, and whether the vendor has a Predetermined Change Control
      Plan (PCCP) on file for algorithm updates.
    scoring_anchors:
      "1": >
        No FDA clearance for the intended use; vendor markets the product for
        indications not covered by any clearance; or clearance was withdrawn.
        No PCCP and no commitment to submit one.
      "2": >
        Cleared for a narrower population or modality than the intended
        deployment (e.g., cleared for adults but marketed for use in adolescent
        stroke patients). Vendor acknowledges the gap but has no remediation
        plan.
      "3": >
        Cleared for the intended modality and adult use. Indications language
        is compatible with deployment in the flagship setting but does not
        explicitly cover all planned deployment sites (e.g., rural transfer
        workflow). No PCCP, but vendor has committed to submit.
      "4": >
        Cleared for the full intended use including the deployment sites and
        patient populations. PCCP submitted or approved; vendor can describe
        which algorithm changes require a new submission vs. a protocol update.
      "5": >
        Cleared for the full intended use with language that explicitly covers
        the deployment workflow. Approved PCCP in place with customer-facing
        change notification commitments. Vendor provides the clearance letter,
        Summary of Safety and Effectiveness, and post-market surveillance
        obligations on request.
    minimum_threshold: 3
    probes:
      - "Is the product FDA 510(k) cleared, and are the cleared indications for use aligned with the intended deployment modality, population, and care setting?"
      - "Does the vendor have an approved or submitted Predetermined Change Control Plan, and which types of algorithm updates are covered under it?"
      - "Does the cleared indication cover use at community or rural sites, or is it limited to academic/tertiary settings?"
      - "Are the vendor's marketing claims (e.g., outcomes, time-to-treatment reductions) consistent with the cleared indications, or do they imply off-label use?"

  - id: contractual_terms
    name: Contractual terms, performance commitments, and data portability
    weight: 0.22
    description: >
      Whether the contract includes SLAs tied to clinical performance (not just
      uptime), rights to audit model performance and demographic subgroup
      reports, data portability at termination, and remedies that are usable in
      practice rather than limited to fee refunds.
    scoring_anchors:
      "1": >
        Standard SaaS contract. No performance SLAs beyond uptime. Termination
        fees punitive. No data portability clause. Remedies limited to pro-rata
        refund.
      "2": >
        Uptime SLA only. Vendor refuses to contract to clinical performance
        thresholds or audit rights. Data return at termination is available but
        in a proprietary format.
      "3": >
        Uptime SLA plus a right-to-audit clause. Vendor will commit to a
        clinical performance reporting cadence but not to remediation if
        thresholds are missed. Data portability in a documented open format.
      "4": >
        SLAs cover uptime, notification latency, and a floor for model
        sensitivity in production. Customer has audit rights including access
        to disaggregated performance reports. Remedies include credits and an
        exit path.
      "5": >
        Performance SLAs with clinically meaningful thresholds and remedies
        that include termination rights without penalty if thresholds are
        chronically breached. Full audit rights. Data portability and model
        output history returned in standard formats at termination with a
        contractually defined transition period.
    minimum_threshold: 2
    probes:
      - "Does the contract include SLAs for notification latency and for the minimum clinical performance threshold the model must maintain in production?"
      - "What are the customer's audit rights over model performance, including disaggregated subgroup reports?"
      - "At termination, what data does the customer receive back, in what format, and over what transition period?"
      - "Are remedies for SLA breaches limited to fee credits, or do they include termination-for-cause rights?"

  - id: vendor_stability
    name: Vendor stability and financial viability
    weight: 0.18
    description: >
      Whether the vendor is likely to exist in five years, including funding
      runway, revenue concentration, customer count, and what happens to the
      deployment if the vendor is acquired, sunsets the product, or fails.
    scoring_anchors:
      "1": >
        Pre-revenue or near cash-out. Two or three customers represent the
        majority of revenue. No source code escrow or continuity provisions.
        Credible risk of insolvency within the contract term.
      "2": >
        Funded but burning cash. Narrow customer base. No source escrow. Would
        likely be acquihired rather than survive standalone.
      "3": >
        Series B or later with 18+ months runway. Diversified customer base.
        Basic continuity commitments (advance notice of sunset, data export on
        shutdown) but no source escrow.
      "4": >
        Profitable or well-capitalized with 24+ months runway. Broad customer
        base. Source code escrow and a documented business continuity plan.
      "5": >
        Public company or equivalent financial transparency, or a strategic
        operator with multi-year profitability. Source code escrow with a
        technical trustee, tested continuity procedures, and assignment
        protections in the contract.
    probes:
      - "What is the vendor's funding stage, cash runway, and customer concentration (share of revenue from the top 3 customers)?"
      - "Is there a source code escrow arrangement, and under what trigger conditions does the customer gain access?"
      - "In an acquisition, what change-of-control protections does the customer have, including pricing and product continuity?"
      - "What is the documented plan if the vendor sunsets the product — notification period, data export, and parallel-run support?"

  - id: policy_alignment
    name: Policy and standards alignment
    weight: 0.15
    description: >
      Whether the vendor's published governance posture aligns with CHAI
      assurance standards, NIST AI RMF, ONC HTI-1 DSI transparency
      requirements, and the customer's internal AI governance policies.
    scoring_anchors:
      "1": >
        No public governance documentation. Unable or unwilling to map their
        practices to any recognized framework.
      "2": >
        References frameworks in marketing material but cannot demonstrate how
        internal practices map to them. No HTI-1 DSI attributes prepared.
      "3": >
        Publishes a basic model card. Has partially implemented NIST AI RMF
        Map/Measure functions. Can produce HTI-1 source attributes on request.
      "4": >
        Publishes a model card conformant with HTI-1 predictive DSI source
        attribute requirements. Has implemented CHAI-aligned assurance
        processes and provides documentation on request.
      "5": >
        Actively conforms to CHAI Assurance Standards Guide, NIST AI RMF, and
        HTI-1 DSI source attribute disclosure. Provides customer-facing
        documentation packs that a compliance officer can use without
        translation.
    probes:
      - "Has the vendor published a model card covering training data, intended use, performance, limitations, and known failure modes?"
      - "Can the vendor produce the HTI-1 predictive DSI source attributes (if the product is embedded in certified health IT) without a custom engagement?"
      - "How does the vendor's internal governance map to NIST AI RMF Map, Measure, Manage, and Govern functions?"
      - "Does the vendor participate in CHAI or an equivalent assurance program, and can they share recent assurance reports?"

  - id: liability_allocation
    name: Liability allocation and indemnification
    weight: 0.10
    description: >
      How liability for model error, bias-driven harm, regulatory noncompliance,
      and IP infringement is allocated, and whether the caps and exclusions
      reflect a serious commitment to stand behind the product.
    scoring_anchors:
      "1": >
        Liability capped at one month of fees. Broad disclaimers of clinical
        liability. No indemnification for IP infringement, regulatory
        noncompliance, or bias-related claims.
      "2": >
        Liability capped at contract value for the year. IP indemnity only.
        Clinical errors explicitly excluded.
      "3": >
        Liability capped at 1-2x annual contract value. IP indemnity plus
        regulatory indemnity for defects in the product's cleared indications.
      "4": >
        Liability capped at 2-3x annual contract value or a fixed floor,
        whichever is higher. Indemnity covers IP, regulatory, and data breach
        claims. Bias claims not excluded by default.
      "5": >
        Liability caps reflect realistic clinical harm exposure, not a fee
        multiple. Indemnity includes IP, regulatory, data breach, and
        bias-related claims. Vendor carries cyber and product liability
        insurance at stated limits and names the customer as additional
        insured.
    probes:
      - "What is the liability cap, and is it expressed as a fee multiple or an absolute dollar amount?"
      - "Which claims are covered by indemnification (IP, regulatory, data breach, bias-related, clinical error)?"
      - "What product liability and cyber insurance does the vendor carry, at what limits, and is the customer an additional insured?"
      - "Are there exclusions that would leave the customer exposed for the most likely harms (e.g., subgroup underperformance leading to disparate impact claims)?"

  - id: privacy_and_data_use
    name: Privacy, PHI handling, and secondary data use
    weight: 0.10
    description: >
      Whether the vendor's PHI handling, BAA terms, subprocessor disclosures,
      and rights to use customer data for model retraining or commercial
      purposes are acceptable and fully disclosed.
    scoring_anchors:
      "1": >
        No BAA, or BAA with aggressive secondary-use rights permitting
        retraining on customer PHI without consent. Subprocessors undisclosed.
      "2": >
        BAA present but grants vendor broad rights to use de-identified data
        for any purpose. Subprocessor list available only under NDA.
      "3": >
        BAA conforming to HIPAA. De-identified data use restricted to product
        improvement. Subprocessor list disclosed at contract signing.
      "4": >
        BAA plus customer opt-in required for any data use beyond model
        inference. Subprocessor list maintained and customer notified of
        changes. Data residency documented.
      "5": >
        BAA plus fully transparent data flow documentation. No secondary use
        without explicit customer opt-in. Subprocessor changes require advance
        notice with opt-out rights. Data residency contractually guaranteed.
    probes:
      - "Does the BAA grant the vendor any rights to use customer PHI or de-identified data for retraining, benchmarking, or commercial purposes?"
      - "Is the subprocessor list disclosed at signing, and are customers notified of changes with an opt-out?"
      - "Where is customer data stored, processed, and backed up, and are those locations contractually bound?"
      - "If the customer opts out of secondary data use, does product functionality degrade in any documented way?"

---

# ============================================================
# Safety
# ============================================================

id: safety
name: Safety
version: "0.1"
description: >
  Evaluates whether the vendor's product is safe to deploy in the customer's
  specific clinical environment. Safety is evaluated against the quality of the
  clinical evidence, the rigor of bias and equity testing, the characterization
  of failure modes, the adequacy of post-market surveillance, and the burden
  the tool places on clinicians. A product can be FDA cleared and still fail
  this rubric.

criteria:
  - id: clinical_evidence_quality
    name: Clinical evidence quality and generalizability
    weight: 0.28
    description: >
      Whether the peer-reviewed evidence supports the intended deployment -
      study design rigor, population representativeness, endpoint relevance,
      and the size of the gap between the validation cohort and the customer's
      patient population and care setting.
    scoring_anchors:
      "1": >
        No peer-reviewed studies, or only single-site retrospective studies
        with small samples. Endpoints are technical (AUC on a held-out set)
        rather than clinically meaningful. Population bears little resemblance
        to the intended deployment.
      "2": >
        One or two retrospective studies, single-center or narrow multi-center.
        Population is academic-tertiary, not community or rural. Endpoints do
        not include time-to-treatment or patient outcomes.
      "3": >
        Multi-center retrospective evidence with at least one prospective
        study. Population includes some community settings but with limited
        demographic diversity. Endpoints include operational metrics
        (notification latency, reader agreement) and at least one clinical
        process measure.
      "4": >
        Multi-center prospective evidence with demographic diversity
        approaching the customer's population. Endpoints include clinical
        process measures and at least preliminary outcome data. Reported
        performance is consistent across subgroups reported.
      "5": >
        Multi-center prospective evidence that includes community and rural
        sites, demographically representative of the customer's population.
        Endpoints include patient outcomes. Performance is reported
        disaggregated and is stable across subgroups. Evidence comes from
        independent investigators, not only vendor-sponsored work.
    minimum_threshold: 3
    probes:
      - "What is the highest-quality evidence available (prospective multi-center with patient outcomes is the standard; retrospective single-site with AUC is the floor)?"
      - "Does the validation cohort include the customer's specific populations - for Ridgeline, that means rural Medicare patients, Hispanic/Latino patients, and Native American patients from the Wind River catchment?"
      - "Do the reported endpoints include time-to-treatment, treatment rates, or clinical outcomes, or are they limited to model performance metrics?"
      - "Is there independent evidence, or is the literature dominated by vendor-affiliated investigators?"
      - "What is the gap between published performance and the expected performance at community hospital sites with non-fellowship-trained readers?"

  - id: bias_and_equity_testing
    name: Bias and equity testing
    weight: 0.22
    description: >
      Whether the vendor has conducted and disclosed fairness testing with
      disaggregated performance across race, ethnicity, age, sex, and care
      setting, and whether performance is acceptable across the customer's
      demographic mix.
    scoring_anchors:
      "1": >
        No demographic performance data. Vendor states that the model "works
        for everyone" without evidence, or declines to share disaggregated
        performance.
      "2": >
        Aggregate performance only. Vendor acknowledges subgroup data exists
        internally but will not disclose it, or has not performed fairness
        testing.
      "3": >
        Sensitivity and specificity reported by age and sex but not race or
        ethnicity. Performance differences noted but not quantified. No plan
        for subgroup monitoring in production.
      "4": >
        Disaggregated performance reported for age, sex, race, and ethnicity.
        Differences quantified. Vendor has a documented fairness testing
        methodology and commits to recurring evaluation.
      "5": >
        Full disaggregated performance including small subgroups relevant to
        the customer (e.g., Native American populations). Fairness testing
        methodology is published. Subgroup performance is monitored in
        production with customer-accessible reports. Known disparities are
        disclosed with mitigation plans.
    minimum_threshold: 2
    probes:
      - "Is sensitivity and specificity reported disaggregated by race, ethnicity, age, sex, and care setting (academic vs. community vs. rural)?"
      - "What does the vendor's training data demographic composition look like, and does it reflect the customer's patient population?"
      - "Has the vendor tested performance on the smaller subgroups relevant to the customer (e.g., Native American, Hispanic/Latino, rural elderly)?"
      - "What is the vendor's plan for monitoring subgroup performance in production, and will disaggregated reports be provided to the customer?"
      - "If subgroup performance is not acceptable, what mitigations does the vendor commit to (additional training data, threshold adjustments, flagged exclusions)?"

  - id: failure_mode_characterization
    name: Failure mode characterization
    weight: 0.18
    description: >
      Whether the vendor can describe the conditions under which the model
      fails - image quality thresholds, anatomical variants, population
      distribution shift, adversarial inputs - and whether those failure modes
      are documented in a way clinicians can act on.
    scoring_anchors:
      "1": >
        Vendor claims the model "handles all cases." Cannot produce a failure
        mode analysis. Known false positive and false negative patterns are
        undisclosed.
      "2": >
        Vendor acknowledges failure modes exist but treats them as proprietary.
        Customer cannot build a safety policy around them.
      "3": >
        Vendor provides a general list of conditions where performance degrades
        (e.g., motion artifact, non-contrast studies) but without quantitative
        thresholds. Clinicians can be trained at a high level.
      "4": >
        Vendor provides a documented failure mode catalog with quantitative
        thresholds where available (e.g., performance degradation below a
        specified noise level). Failure modes are linked to clinical workflow
        decisions.
      "5": >
        Vendor publishes a failure mode and effects analysis suitable for
        clinical use. Known high-risk scenarios are flagged at inference time
        in the UI. The catalog is updated as new failure modes are identified
        in production, and customers are notified.
    probes:
      - "What is the vendor's failure mode catalog - conditions under which sensitivity or specificity drops below an acceptable threshold?"
      - "How does the product behave when inputs fall outside the intended distribution (wrong modality, pediatric patient, non-diagnostic study)?"
      - "Are known failure modes flagged at inference time so the clinician can weight the output accordingly?"
      - "How are newly discovered failure modes communicated to customers, and is there a timeline commitment?"

  - id: post_market_surveillance
    name: Post-market surveillance and real-world performance monitoring
    weight: 0.17
    description: >
      Whether the vendor operates a post-market surveillance program that
      detects performance drift, aggregates safety signals across customers,
      and communicates findings to customers in time to act.
    scoring_anchors:
      "1": >
        No post-market surveillance beyond individual customer bug reports. No
        aggregate drift monitoring. Customer finds out about problems on the
        wire.
      "2": >
        Vendor collects telemetry but does not share aggregate findings with
        customers. No drift monitoring commitment.
      "3": >
        Vendor runs basic drift monitoring and publishes periodic performance
        updates to customers. Safety signals are handled reactively.
      "4": >
        Vendor operates a structured post-market surveillance program including
        drift detection, subgroup monitoring, and proactive customer
        notification of safety signals with target timelines.
      "5": >
        Post-market surveillance is comparable to FDA expectations for
        Class II software devices. Aggregate performance reports are published
        quarterly. Safety signals trigger customer notifications within a
        contractually defined window. Customers receive access to their own
        site-level monitoring data.
    minimum_threshold: 2
    probes:
      - "What is the vendor's post-market surveillance program, and how does it detect performance drift over time?"
      - "How and when does the vendor notify customers of safety signals detected across the install base?"
      - "Does the customer receive site-specific monitoring data, or only vendor-aggregated reports?"
      - "What is the track record of the vendor's surveillance program (recent drift events detected, time-to-notification, remediation)?"

  - id: alert_and_output_burden
    name: Alert burden and output actionability
    weight: 0.15
    description: >
      Whether the product's alerts, notifications, and outputs support clinical
      decision-making without generating fatigue that causes clinicians to
      ignore the signal. Alert fatigue is a safety hazard, not a usability
      concern.
    scoring_anchors:
      "1": >
        High false-positive rate in practice. Alerts cannot be tuned by site
        or user role. No data on alert override rates or adoption in the
        install base.
      "2": >
        Alerts tunable only at the global level. Vendor cannot produce override
        rate data from comparable deployments.
      "3": >
        Alerts tunable by site and role. Override rate data available on
        request but not benchmarked. Documented false positive rate roughly
        matches published specificity.
      "4": >
        Alerts tunable by site, role, and time-of-day. Override rates tracked
        and benchmarked across the install base. False positive rate in
        production is within a stated tolerance of published specificity.
      "5": >
        Alerts are designed with clinical workflow input. Override rates,
        acknowledgement rates, and time-to-action are monitored in production
        and shared with customers. Thresholds are adjustable with documented
        consequences. False positive rate in production is contractually
        bounded.
    probes:
      - "What is the false positive rate in production across the vendor's install base, and how does it compare to the published specificity from validation studies?"
      - "Can alert thresholds be tuned by site, by clinician role, and by time-of-day or shift?"
      - "What is the median alert override rate in the install base, and how does that track with clinical adoption?"
      - "For the customer's specific workflow (e.g., rural community hospital with ED physicians reading overnight), what alert cadence and override rate should be expected?"

---

# ============================================================
# Implementation
# ============================================================

id: implementation
name: Implementation
version: "0.1"
description: >
  Evaluates whether the vendor's product can actually be deployed in the
  customer's environment at the scope and pace required. Most failed AI
  deployments are not technology failures - they are implementation failures:
  integration underestimated, workflow redesigned poorly, training shipped
  late, rollback unplanned. This rubric forces the evaluator to estimate real
  implementation cost and surface the constraints that determine whether the
  deployment succeeds at every site, not just the flagship.

criteria:
  - id: ehr_integration_architecture
    name: EHR and PACS integration architecture
    weight: 0.22
    description: >
      Whether the integration with the customer's EHR, PACS, and imaging
      infrastructure is based on supported, standards-conformant interfaces and
      whether it scales to all deployment sites without per-site custom work.
    scoring_anchors:
      "1": >
        Integration relies on screen-scraping, unsupported APIs, or
        vendor-specific connectors that require per-site customization. No
        FHIR, no DICOMweb, no HL7 v2 conformance statement.
      "2": >
        HL7 v2 interfaces only. No FHIR R4 support. PACS integration requires
        vendor-side customization per PACS product. Epic App Orchard /
        Showroom listing absent.
      "3": >
        HL7 v2 plus partial FHIR R4 support. DICOMweb supported. Listed in
        Epic's App Orchard / Showroom but with caveats. Single-PACS
        environments straightforward; multi-PACS requires custom work.
      "4": >
        FHIR R4 and DICOMweb fully supported. Listed as an approved Epic
        integration with documented connection patterns. Supports multi-PACS
        environments with standard configuration.
      "5": >
        Production-proven Epic integration at systems of comparable complexity
        and scale. FHIR R4 / DICOMweb / HL7 v2 all supported. Multi-PACS
        deployments (e.g., Sectra + Fuji Synapse) are documented reference
        architectures, not one-off builds.
    minimum_threshold: 2
    probes:
      - "Which specific Epic integration pattern does the vendor use (App Orchard / Showroom, Bridges, direct HL7/FHIR), and is the customer's Epic version supported?"
      - "Does the vendor support both PACS in the customer's environment (e.g., Sectra at the flagship, Fuji Synapse at community hospitals) without per-PACS custom work?"
      - "Are FHIR R4 and DICOMweb conformance statements available, and which resources/services are supported?"
      - "What reference deployments exist at health systems of comparable size and architectural complexity?"

  - id: workflow_fit
    name: Clinical workflow fit across all sites
    weight: 0.20
    description: >
      Whether the vendor's workflow assumptions match the customer's
      flagship AND community / rural site realities, including the mix of
      fellowship-trained vs. general radiologists, nursing-driven stroke code
      workflows, and overnight coverage patterns.
    scoring_anchors:
      "1": >
        Workflow designed for academic tertiary setting only. Product assumes
        fellowship-trained readers, in-house neurointerventional coverage, and
        continuous radiology staffing. No workflow guidance for community or
        rural sites.
      "2": >
        Generic workflow documentation. Vendor has no specific configuration
        for community or rural deployments. Training materials assume one user
        profile.
      "3": >
        Workflow configurable for community settings but not rural.
        Documentation acknowledges reader diversity but configurations are
        limited. Nursing workflows are supported but not co-designed.
      "4": >
        Documented configurations for flagship, community, and rural settings.
        Reader profiles (neuroradiology, general radiology, ED physician
        overread) are supported. Nursing workflow designs include stroke
        coordinator dashboards and mobile notification patterns.
      "5": >
        Workflow has been deployed in comparable multi-site, multi-setting
        systems with published operational results. Rural transfer workflows,
        community-to-tertiary communication patterns, and non-fellowship
        reader overreads are first-class design considerations. Vendor
        provides a workflow playbook calibrated to the customer's
        configuration.
    minimum_threshold: 3
    probes:
      - "Has the vendor deployed at health systems with a mix of tertiary, community, and rural sites, and what did the workflow look like at each tier?"
      - "How does the product support non-fellowship-trained radiologists and ED physician overreads in community hospitals?"
      - "What is the mobile notification and stroke coordinator workflow, and how does it behave when the on-call neurointerventionalist is off-network or off-hours?"
      - "How does the workflow handle cases originating at rural sites that require transfer - does the notification propagate to the receiving tertiary team?"

  - id: training_and_change_management
    name: Training, change management, and clinical adoption support
    weight: 0.18
    description: >
      Whether the vendor provides role-specific training for all user
      populations, ongoing adoption support, and change management resources
      sufficient to move beyond initial go-live into sustained use.
    scoring_anchors:
      "1": >
        Training consists of a one-hour webinar and a PDF. No role-specific
        materials. No post-go-live support beyond a ticket queue.
      "2": >
        Role-specific training for radiology only. Nursing and ED physician
        training is generic. Adoption support is passive.
      "3": >
        Role-specific training for radiology, nursing, and ED physicians.
        Go-live support on-site for the flagship. Community-site training is
        remote-only.
      "4": >
        Training tailored to each user role and site tier. On-site go-live
        support at flagship and community sites. Documented adoption
        measurement with checkpoints at 30/60/90 days.
      "5": >
        Embedded clinical transformation specialist assigned to the customer.
        Training curricula for every user role, refreshed with product
        updates. Adoption dashboards shared with the customer. Reference to
        comparable deployments where adoption was measured and sustained.
    probes:
      - "What training is provided for each user role (neuroradiologist, general radiologist, ED physician, stroke coordinator, nursing, IT operations)?"
      - "What on-site support is provided at go-live for flagship vs. community vs. rural sites?"
      - "How is adoption measured after go-live, and what remediation happens if a site is not adopting?"
      - "Can the vendor produce adoption curves from comparable deployments, segmented by site tier and user role?"

  - id: it_burden_and_infra_fit
    name: IT burden, infrastructure fit, and network constraints
    weight: 0.15
    description: >
      Whether the deployment effort (FTEs, calendar time, infrastructure
      requirements) is realistic given the customer's IT capacity, including
      network constraints at rural sites.
    scoring_anchors:
      "1": >
        Requires on-premise GPU infrastructure that the customer does not
        have. Bandwidth requirements exceed rural site capacity. Implementation
        estimate is undefined or clearly understated.
      "2": >
        Cloud-hosted but bandwidth-hungry. Rural sites would require network
        upgrades not scoped in the proposal. Implementation estimate is
        vendor-only and does not account for customer FTEs.
      "3": >
        Cloud-hosted with documented bandwidth requirements that are within
        the customer's rural site capacity. Implementation estimate includes
        customer FTE needs but only at a total level.
      "4": >
        Cloud-hosted with tested bandwidth requirements at comparable rural
        sites. Implementation plan is per-site with customer FTE allocations
        and calendar time. Failover behavior for network degradation is
        documented.
      "5": >
        Deployment architecture has been validated at rural sites with
        bandwidth profiles comparable to or worse than the customer's. Clear
        degraded-mode behavior when network is impaired. Implementation plan
        is per-site with FTE and calendar commitments the vendor stands
        behind.
    probes:
      - "What are the network bandwidth and latency requirements at each site tier, and have they been validated at rural deployments comparable to the customer's Wyoming and Kansas sites?"
      - "What is the per-site implementation estimate in customer IT FTEs and calendar weeks, broken out by flagship, community, and rural tiers?"
      - "What infrastructure (on-prem or cloud) is required, and what is the degraded-mode behavior when the network is impaired?"
      - "Is the customer expected to procure or upgrade any infrastructure (GPU, VPN, WAN capacity), and is that scoped in the proposal?"

  - id: migration_and_rollback
    name: Migration and rollback path
    weight: 0.13
    description: >
      Whether there is a tested path to roll back the deployment if the tool
      underperforms, and whether migration from an incumbent product can be
      executed without gaps in clinical coverage.
    scoring_anchors:
      "1": >
        No rollback plan. Decommissioning an incumbent requires a hard cutover
        with no parallel-run. Customer loses incumbent functionality if rollout
        fails.
      "2": >
        Rollback documented at a conceptual level only. No tested procedure.
        Parallel-run with an incumbent product is not supported.
      "3": >
        Documented rollback procedure. Parallel-run with incumbent supported
        for a limited period. Site-level rollback possible.
      "4": >
        Tested rollback procedure with documented duration and clinical
        coverage guarantees. Parallel-run supported for a customer-defined
        window. Per-site and per-module rollback supported.
      "5": >
        Rollback procedure validated at comparable deployments. Parallel-run
        with incumbent is standard practice. Contractually, the customer can
        revert without penalty during a defined stabilization window with
        vendor support throughout.
    probes:
      - "What is the documented rollback procedure, and how long does it take to execute per site?"
      - "Can the customer run the new tool in parallel with an incumbent (e.g., Aidoc) during stabilization, and for how long?"
      - "If the customer decommissions an incumbent to deploy this product, what is the migration plan including training, workflow rebuild, and coverage continuity?"
      - "Is there a contractual off-ramp if the product does not meet defined thresholds during a stabilization window?"

  - id: tco_realism
    name: Total cost of ownership realism
    weight: 0.12
    description: >
      Whether the vendor's cost proposal reflects true total cost of ownership
      over the contract term, including implementation, training, workflow
      redesign, network upgrades, and incumbent decommissioning costs.
    scoring_anchors:
      "1": >
        Subscription price only. Implementation, training, and network costs
        are omitted or described as "included" without itemization.
      "2": >
        Subscription plus one-time implementation fee. No itemization of
        customer-side costs. No escalator transparency.
      "3": >
        Subscription, implementation, and training itemized. Escalator
        disclosed. Customer-side labor is acknowledged at a total level.
      "4": >
        Five-year TCO model with itemized subscription (and escalator),
        implementation, training, network upgrades, and customer-side FTEs.
        Sensitivity to plausible scenarios (added sites, renewal escalation)
        is modeled.
      "5": >
        Five-year interrogable TCO that the customer's CFO can audit, including
        incumbent decommissioning costs where applicable, sensitivity analysis
        on system expansion and renewal price increases, and benchmarking
        against comparable deployments.
    probes:
      - "What is the five-year TCO including subscription, implementation, training, network upgrades, and customer-side FTEs?"
      - "What is the renewal escalator, and is it capped?"
      - "If the customer adds or removes sites during the contract term, how does the price adjust?"
      - "For replacement scenarios, what are the estimated costs of decommissioning the incumbent product (contract exit, migration, retraining)?"

---

# ============================================================
# Auditability
# ============================================================

id: auditability
name: Auditability
version: "0.1"
description: >
  Evaluates whether the customer will be able to answer, during an adverse
  event investigation, regulatory inspection, or internal audit: what did the
  model see, what did it output, when, to whom, on what version, and what did
  the clinician do with it. Auditability is where AI governance commitments
  either hold up or collapse. A product that cannot reconstruct its own
  decisions creates unbounded liability for the operator.

criteria:
  - id: model_transparency
    name: Model transparency and documentation
    weight: 0.25
    description: >
      Whether the vendor provides a model card, training data documentation,
      versioning information, and clinical interpretation guidance sufficient
      for the customer to explain the tool to a regulator or a patient's
      family.
    scoring_anchors:
      "1": >
        No model card. Training data undocumented. Version numbers are not
        visible to the customer. The customer cannot explain how the model
        works beyond marketing language.
      "2": >
        Basic marketing-style description. Version tracked internally by the
        vendor but not surfaced in outputs. Training data described at a high
        level only.
      "3": >
        Model card with intended use, performance, and known limitations.
        Training data described in aggregate (size, time window, site types).
        Versions visible in outputs.
      "4": >
        Model card conformant with HTI-1 predictive DSI source attributes.
        Training data documented including demographic composition. Version
        history published and surfaced on every output.
      "5": >
        Comprehensive model card including intended use, training/validation
        populations, known failure modes, fairness testing results, and
        clinical interpretation guidance. Full version history, change
        rationale, and model-output provenance accessible to the customer.
        Documentation is written for clinical and compliance audiences, not
        just data scientists.
    minimum_threshold: 3
    probes:
      - "Is a model card available that covers intended use, training/validation populations, performance, limitations, and fairness testing?"
      - "Is the training data demographic composition documented in sufficient detail to assess applicability to the customer's population?"
      - "Are model versions surfaced on every output, and is the full version history accessible to the customer?"
      - "Can the vendor map their documentation to HTI-1 predictive DSI source attributes without a custom engagement?"

  - id: logging_and_event_capture
    name: Logging and event capture
    weight: 0.22
    description: >
      Whether the product logs every model inference with inputs, outputs,
      version, timestamp, user context, and downstream clinician action, and
      whether those logs are immutable, queryable, and retained for the
      customer's required period.
    scoring_anchors:
      "1": >
        Inference logs retained by the vendor only, access subject to vendor
        discretion. No customer-accessible query interface. Logs may be
        overwritten.
      "2": >
        Logs accessible to the customer via support ticket. Limited detail
        (inference-level only, no user context or downstream action).
        Retention period under one year.
      "3": >
        Logs accessible to the customer via an interface. Capture inputs,
        outputs, version, and timestamp. Retention 12-24 months. Immutability
        claimed but not demonstrated.
      "4": >
        Customer-accessible logs capturing inputs (or cryptographic hashes),
        outputs, version, user, timestamp, and clinician action (acknowledged,
        overridden, escalated). Retention configurable up to 7 years.
        Immutability documented.
      "5": >
        Logs are comprehensive, immutable (append-only with integrity
        verification), exportable in standard formats, and retained per
        customer policy up to the longest applicable regulatory retention
        period. Capture extends to integration touchpoints and clinician
        workflow actions. Logs are query-able by the customer's audit team
        directly.
    minimum_threshold: 3
    probes:
      - "What is logged per inference - inputs, outputs, model version, user, timestamp, downstream clinician action?"
      - "How are logs protected from tampering, and can integrity be demonstrated?"
      - "What is the default retention period, and is it configurable to meet the customer's policy (state statutes often require 7+ years for clinical records)?"
      - "Can the customer's audit or compliance team query logs directly, and in what format are they exportable?"

  - id: monitoring_and_drift_detection
    name: Production monitoring and drift detection
    weight: 0.20
    description: >
      Whether the product continuously monitors performance in production,
      detects distributional drift and subgroup underperformance, and surfaces
      actionable signals to the customer before clinicians encounter them in
      patient care.
    scoring_anchors:
      "1": >
        No production monitoring beyond uptime. No drift detection. Customer
        discovers performance issues through adverse events.
      "2": >
        Vendor-side monitoring only. Aggregate metrics published infrequently.
        Customer cannot see site-specific performance.
      "3": >
        Site-level monitoring available on request. Drift detection
        implemented for aggregate performance. No subgroup drift detection.
      "4": >
        Customer-accessible monitoring dashboards with site, user, and subgroup
        drift detection. Alert thresholds configurable. Signals include
        proposed remediation.
      "5": >
        Continuous monitoring with site-level, subgroup-level, and
        workflow-level drift detection. Contractual commitment to notify the
        customer within a defined window when drift exceeds thresholds.
        Customer can configure monitoring to feed their internal AI
        governance reporting.
    minimum_threshold: 2
    probes:
      - "What performance metrics are monitored continuously in production, and at what granularity (site, user, subgroup)?"
      - "How is distributional drift detected, and what thresholds trigger a customer notification?"
      - "Are monitoring dashboards directly accessible to the customer, or mediated through vendor reports?"
      - "What is the contractual commitment on notification latency when drift is detected?"

  - id: incident_response_readiness
    name: Incident response readiness
    weight: 0.18
    description: >
      Whether the vendor has a tested incident response process for
      AI-specific events (model misbehavior, safety signals, demographic
      underperformance, data incidents) and can support the customer through
      an investigation, regulatory inquiry, or MDR filing.
    scoring_anchors:
      "1": >
        Standard SaaS incident response only. No AI-specific playbooks. Vendor
        has not supported a customer through a regulatory inquiry or MDR
        filing.
      "2": >
        General incident response playbook includes AI events but no tested
        procedures. Vendor has limited experience with clinical AI incident
        investigations.
      "3": >
        Documented AI incident response with customer notification SLAs. Root
        cause analysis supported. Some MDR / regulatory inquiry experience.
      "4": >
        Tested AI incident response with defined roles, customer notification
        SLAs, and root cause analysis deliverables. Vendor has supported MDR
        filings and regulatory inquiries and can produce evidence of that
        support.
      "5": >
        Mature AI incident response program with tabletop exercises, defined
        customer-side integration (incident commander roles, clinical safety
        officer interfaces), documented root cause analysis quality, and a
        track record of supporting regulatory inquiries and MDR filings for
        customers. Incident transparency reports published.
    probes:
      - "What is the AI-specific incident response playbook - what qualifies as an AI incident, and what are the notification SLAs to the customer?"
      - "Has the vendor supported a customer through an MDR filing or FDA inquiry, and can they produce evidence of that support?"
      - "What root cause analysis deliverable does the customer receive after an incident, and in what timeframe?"
      - "Does the vendor participate in customer-side incident tabletop exercises, and will they commit to that in the contract?"

  - id: data_lineage_and_retention
    name: Data lineage, retention, and portability
    weight: 0.15
    description: >
      Whether the customer can trace every model output back to the source
      data, retain outputs per the customer's legal and clinical record
      policies, and export all data (inputs, outputs, logs, model versions)
      at termination.
    scoring_anchors:
      "1": >
        No data lineage. Outputs cannot be traced back to source studies with
        certainty. Retention at vendor discretion. No export at termination
        or export is in a proprietary format.
      "2": >
        Basic lineage (study ID to output) but no model-version tie-in.
        Retention configurable but capped below customer policy. Export
        available but limited.
      "3": >
        Lineage covers study ID, model version, and output. Retention
        configurable to meet customer policy. Export at termination is
        available in documented formats.
      "4": >
        End-to-end lineage including input data, preprocessing, model version,
        output, and downstream workflow action. Retention configurable to the
        longest applicable regulatory period. Export includes logs and
        history.
      "5": >
        End-to-end lineage with cryptographic integrity. Retention
        configurable and contractually guaranteed. At termination, full
        export including inputs (or hashes), outputs, logs, model versions,
        and monitoring history in standard formats with a defined transition
        period.
    probes:
      - "Can every output be traced back to the specific input study and the specific model version, and is that lineage available to the customer's audit team?"
      - "What retention periods are supported, and can they be configured to meet the customer's clinical record policy (often 7-10 years or longer)?"
      - "At termination, what data does the customer receive back - inputs or hashes, outputs, logs, model versions, monitoring history - and in what formats?"
      - "Is there a defined transition period during which the vendor supports customer access to historical data after termination?"

---
