Home » Artificial Intelligence

Metrology for AI in Medical Decision-Making

1 September 2023 286 views No Comment

Arvind Rao, University of Michigan Ann Arbor

    There has been a renaissance in the use of artificial intelligence for radiology and pathology applications. AI systems in these complex applications are multi-component pipelines comprising data processing, model learning, and inferential modules/layers with potential components for feedback/control loops for efficient optimization of the learning task. In addition, there is almost always an element of AI-human chaining or collaboration involving transfer of an AI-generated decision to humans for downstream decision-making.

    To ensure the reliability of such pipelines and their stable operating characteristics, they and their individual modules need to be audited. In contrast to traditional QA/QC assessment procedures for large-scale industrial or software systems, AI-based systems present a unique challenge: They are continuously evolving as a function of incoming data. The need for quality assessments of these AI pipelines is further challenged by the asymmetric cost nature and high-stakes modes of failure that, if encountered, can lead to incorrect decisions with catastrophic consequences for the health care setting.


    In radiology, AI is based on imaging data drawn from radiographic modalities such as X-Rays, mammograms, computed tomography, and magnetic resonance imaging. Each has its own physics for data acquisition. Batch effects based on instrument variation, vendor/manufacturer, site-to-site differences in acquisition protocols, and device physics have nonlinear impact on signal fidelity and signal-to-noise ratio, which then has downstream consequences on image interpretation. For automated image interpretation, intensity normalization is an essential pre-processing step prior to model training. With these aspects in mind, the suggestion of ‘datasheets for imaging data’ was explored following the popular paradigm of “Datasheets for Datasets.” This rubric describes image data provenance, alongside aspects of image acquisition and data interpretation. Aspects like the purpose for which the data set is created (to potentially prevent ‘off-label’ use), its composition (types of instances: whether continuous, binary, categorical, missingness structure, sources of error etc.), collection aspects including consent, pre-processing procedures, distribution licensing and data update/maintenance procedures. The last element is even more relevant in the current scenario of generative AI (e.g., large language models built on huge corpuses of multi-modal data). Tools like ChatGPT seem capable of demonstrating competence on exams like the United States Medical License Examination; however, their successful translation into safety-critical enterprises like health and wellness will require deep, real-time auditing of their constituent ingredients (be it their training data or the models), as well as of their associated uncertainty characteristics to ensure fruitful translation into the humans that treat patients. The use of generative AI models to generate synthetic data to augment training scenarios presents a double-edged sword in this regard, presenting challenges at the privacy, security, noise-injection and model-overfitting level. All these elements require deep involvement of creators and consumers to ensure that model deployments using these data sets are calibrated in the context of continuously varying distributions of features and labels. Thus, standards for principled data credentialing are essential if one is to ascribe any notion of confidence to inferences based on AI models built on these (evolving) data sets.


    The advent of feature-based ‘radiomic’ models, and more recently a slew of deep learning models that predict tumor regions or molecular alterations on an image has led to high-parameter characterization of relationships between image-derived intensities and outcomes. In order to ensure an appropriate use of models in the right setting (correct domain, population matched to the training set, matched training-test distributions), model scorecards have been proposed. These scorecards describe not just the model class used to fit relationships, but also specific information about where the model is likely to perform ‘in scope.’ The wide variation in parametrization-space for each model parameter, coupled with the large number of parameters in these models (in their millions!) can present a significant challenge for describing model uncertainty. Scorecards aims to describe the taxonomy of systematic biases that could be encountered during deployment phase, as well as providing a way for the systematic stress-testing of model credibility under a variety of settings. Additionally, there is critical information about the training data, evaluation data, associated confounding factors, cost-functions for model optimization and evaluation metrics. These have the potential to reveal (if well designed) operating-regimes where these models are suitable or not suitable. Both traditional metrics and ethical perspectives can be enmeshed in these evaluations. These set of specifications can not only illustrate model risks vs. benefits but can also offer valuable insights into model transparency.


    Inferential assessments in model use can involve conversations around calibrated inference and prediction intervals accompanying AI-generated predictions. Risk asymmetric scenarios like healthcare offer avenues for cautious optimism in the use of AI engines in clinical prediction, related to high costs of catastrophic false positives or false negative predictions. Statistical considerations like conformal inference and uncertainty quantification will help ensure that AI-generated predictions are suitably communicated to the human decision maker so as to make appropriate decisions. These have important consequences for ‘trust in AI’ considerations as well. An additional avenue, termed “adversarial robustness,” assesses drift in predictions due to injected noise in the data input to understand the reliability of predictions as function of change in input signal. This is a field in active study with implications for causal inference in the context of complex high dimensional models.

    In the specific case study presented in the JSM session, we studied AI for tumor segmentation problems using these elements—covering data, model and inference aspects. We found that evaluation metrics can vary significantly by data set quality.

    Models trained on cleaned/curated training data can have different performance characteristics in noisy or curated evaluation sets. Similarly, uncertainty quantification accompanying predictions from models trained on curated data can be quite different in interpretation compared to models trained in the wild. Noise can manifest in the data space or in the quality of labels used for model training. Curation can involve management of either or all these aspects. This issue becomes even more prominent in the realm of federated or distributed learning in the field of radiology-AI since privacy (e.g., HIPAA) considerations prevent model training via data aggregation at centralized sites but rather rely on distributed training protocols with possibly different noise characteristics for the training data in each site. Model performance and generalization are affected by training data parameters, data batch effects and noise characteristics. Uncurated data-based models exhibit lower dependence on variations to image quality as well as superior robustness to adversarial perturbation than models trained on curated data. On the other hand, models based on curated/cleaned data exhibit superior training metrics, better match with ground truth in (curated) evaluation sets and tighter uncertainty intervals. This again demonstrates the ‘no free lunch’ idea in that curated or uncurated models have their relative advantages and disadvantages; thus requiring the collaboration of the model developer, data set creator and domain expert to determine the specific use cases for any specific kind of model deployment. These have implications for ‘algorithm change protocols’ and update strategies—leading to potential performance changes (i.e., gains or losses) in an almost real-time manner. These have important effects on the stability of the operating regimes of these AI-engines and need to be factored into standardization aspects.

    Human-AI Communication and Feedback

    The final element is the link from AI-derived predictions to the human for downstream decision-making. The elements of the rubric articulated in “‘Hello AI’: Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making” describe the design objectives of the AI system, its capabilities, limitations, and interpretation of thresholds used for decision-making. This rubric also describes the interpretation of the metrics like accuracy, true and false positive rates, etc., including the costs incurred in the event of classification error. Diagnostics for performance aspects of the ML model (like efficiency, consistency, diversity, generalizability, uncertainty quantification, reporting prediction intervals) also ought to be described. Adoption considerations like regulatory approval, peer-reviewed publications supporting the model’s construction, legal liability, impact on clinician workflows etc. Finally, information aspects informing human collaboration with AI assistants, like differential human attention to areas of AI weakness, conflict resolution, trust and interpretability, factoring AI’s subjectivity to disease upgrading or downgrading, and description of collaboration modes attuned to the specific objectives the AI is optimized for, are essential ingredients in order to facilitate collaborative decision-making in these hybrid workflows (i.e., spanning human-AI interactions). Another key aspect of system design is the consideration of ‘feedback-loops’ from intermediate points along the AI-pipeline for refinements/tuning for data updating, model training or inferential augmentation. These feedback loops can inadvertently inject noise into these systems, even leading to deterioration in performance (e.g., the ChatGPT system getting worse with time ) due to a variety of factors including irrelevant/erroneous human feedback, misinformation or introduction of ‘hallucinated’ data updates). These considerations are essential to ensure stable operating characteristics, along with keystone principles Ike reliability and reproducibility. These present important layers for quantifying the AI-human collaboration cycle, needing significant input from experts in human-computer interaction and human-AI teaming.

    Consequences for Design and Directions Forward

    Given the widespread commoditization of AI, coupled with availability of AI widgets that enable one to rapidly spawn AI instances for different use cases (radiology, pathology, etc.) very quickly at the enterprise level, there is a case for cautious optimism in mission critical, risk-asymmetric scenarios like medical decision-making. Many professional societies (like the American College of Radiology and Association for Computing Machinery), and public institutions such as the FDA, the National Academies, and NIST, as well as international policy bodies (EU think tanks) have created AI policy guidance for the safe and responsible deployment of AI in different scenarios. Development of taxonomies of failure modes in different use scenarios, credentialing rubrics for data/models/inference/feedback components and the study of cascading failures across these elements will be a key component of ‘failure-safe’ design procedures rather than ‘success-based’ design thinking. Drifts in data distributions, model parametrizations, inference credibility etc. have the potential to create catastrophic breakdown, not just at the decision-theoretic level, but at the policy level, with one bad case study creating a ‘cautionary tale’ or misperceptions that can take decades to address. The potential for data hallucination and misinformation both at content creation level and cascading of failures in multicomponent pipelines are additional considerations. As large language models become more mainstream, hallucination effects have the potential to corrupt data and alter uncertainty quantification on almost a real-time, evolving basis—creating need for careful model and inferential credentialing. The need for standards cannot be over-stated. Perhaps parallels from rubrics in safe drug design and drug ‘adverse labels’ or nutrition labels can help formulate these considerations more systematically. This needs multi-stakeholder engagement to responsibly audit these AI instruments—drawing from quality engineers, economists, ethicists, cybersecurity experts, privacy experts, data creators, consumers, AI modelers, regulatory law experts, health practitioners, statisticians and data scientists, to name a few. While there are several ongoing efforts by various consortia on this topic, the discussion of those frameworks is outside the scope of this article.

    Emerging Role for Statisticians and Data Scientists

    It is hard to imagine the success of these endeavors without statisticians and data scientists: the multifactorial approach described above provides opportunities for statisticians in every discipline, from quality and reliability applications to uncertainty quantification, to AI modeling and human-machine interaction. As AI-based offerings evolve into utilities that can be accessed for general use, the time is ripe for fostering collaborative endeavors that engage domain scientists with statisticians to address these challenges in a principled manner.

    Further Reading

    NIST Artificial Intelligence Risk Management Framework

    Blueprint for an AI Bill of Rights: Making Automated Systems Work for the American People

    “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)–Based Software as a Medical Device (SaMD)”

    1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

    Leave your response!

    Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

    Be nice. Keep it clean. Stay on topic. No spam.

    You can use these tags:
    <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

    This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.