Home » Artificial Intelligence

AI–Enabled Medical Devices and Diagnostics: Statistical Challenges and Opportunities from a Regulatory Perspective

1 September 2023 1,667 views No Comment

Gene Pennello and Frank Samuelson

Gene Pennello

Frank Samuelson

Gene Pennello is a statistician and Frank Samuelson a physicist for the US Food and Drug Administration Division of Imaging, Diagnostics, and Software Reliability. The division conducts research on methods for evaluating performance of medical devices, including medical imaging systems, diagnostic tests, and medical devices enabled with artificial intelligence / machine learning algorithms. Technical disciplines of the staff include physics, electrical and biomedical engineering, mathematics and statistics, computer science, and medical radiology.

Artificial intelligence is poised to deliver important contributions to medical device application areas, including image acquisition and processing; earlier disease detection; more accurate diagnosis, prognosis, and risk assessment; identification of new observations or patterns of human physiology; development of personalized diagnostics and therapeutics; and treatment response monitoring, to name a few. However, the complexity of medical device AI algorithms, the data-driven nature in which they are trained, their rapid application to many medical areas, and the unique nature of clinical medical data (e.g., low prevalence of disease, lack of or difficulty in obtaining truth data, etc.) create challenges in developing robust evaluation methods for AI devices. These include clinical and nonclinical testing and understanding the impact of these devices in the real world. In addition, some medical devices may employ AI algorithms designed to learn as data accumulates, which presents unique evaluation challenges.

According to the US Food and Drug Administration public listing, at least 522 AI-enabled medical devices have been marketed in the United States as of October 5, 2022. This data and a 2020 review by Stan Benjamens, Pranavsingh Dhunnoo, and Bertalan Meskó in their NPJ Digital Medicine article, “The State of Artificial Intelligence–Based FDA-Approved Medical Devices and Algorithms: An Online Database,” indicate marketed AI-based medical devices are predominately in the field of radiology, followed by cardiology and internal medicine/general practice.

In pathology, the first marketed AI device is Paige Prostate, a software device applied to digital histopathology images of prostate needle biopsies that uses a neural network to classify an image as suspicious or not for prostate cancer. When suspicious, Paige Prostate provides a single coordinate (X,Y) of the location with the highest probability of cancer for further review by a pathologist.

In radiology, medical devices incorporating AI have expanded from assisting radiologists in segmentation or detection (CADe) to quantitative imaging, computer-assisted diagnosis (CADx), triage, and multi-class classification. One example of a quantitative imaging device is the caption interpretation automated ejection fraction software, which applies machine learning algorithms to process echocardiography images and estimates left ventricular ejection fraction.

QuantX is an early example of an AI CADx device that assists in the characterization and diagnosis of breast abnormalities. The device automatically registers, segments, and analyzes user-selected regions of interest in magnetic resonance images of the breast to extract morphological features (e.g., lesion area, sphericity, homogeneity, volume, contrast, etc.) and radiomic features. These are then analyzed by an AI algorithm to obtain a QI score, which confers relative likelihood of malignancy.

Other examples of CADx devices that combine regions of interest detection functions include OsteoDetect and FractureDetect, which use ML to analyze adult radiographs of various anatomic areas to identify and highlight potential fractures while providing additional diagnostic information.

Computer-assisted triage and notification (CADt) devices create an active notification to providers for cases identified as likely containing a time-sensitive finding, giving them the option to move such cases to the top of the reading queue. Cases unflagged by CADt devices are read without priority according to standard of care. By aiding radiologists in identifying cases that should be given high read priority, CADt devices may provide benefit when early detection of the target condition is crucial for effective intervention.

Three examples of AI CADt devices are ContaCT, BriefCase, and Viz ICH, which are applied to computed tomography angiograms of the brain or head. ContaCT notifies neurovascular specialists of a potential large vessel occlusion stroke, while BriefCase notifies hospital networks and trained radiologists of potential large vessel occlusion stroke. Viz ICH notifies hospital networks and trained clinicians of potential intracranial hemorrhage stroke.

CADt devices are evaluated not just for accuracy in detecting the target condition but also for time saved in detecting the target condition earlier in cases from patients most likely to benefit from an earlier image interpretation. Yee Lam Elim Thompson, Gary Levine, Weijie Chen, Berkman Sahiner, Qin Li, Nicholas Petrick, Jana Delfino, Miguel Lago, Qian Cao, Qin Li, and Frank Samuelson applied queueing theory to develop estimators of the mean time saved by CADt devices among cases with the target condition in “Evaluation of Wait Time Saving Effectiveness of Triage Algorithms.”

Most marketed classification devices distinguish between two states of health: presence or absence of a target condition. Multi-class classification devices distinguish between more than two states of health. qER is a CADt device that applies classical ML and a deep convolutional neural network to voxels on brain CT to detect intracranial hemorrhage, mass effect, midline shift, and/or cranial fracture. Additionally, assays are being developed to screen for multiple cancers by detecting circulating tumor DNA in plasma samples. When a cancer is detected, ML is employed to classify its origin and focus appropriate diagnostic work-up.

Unfortunately, many developers have not yet taken full advantage of ML for multi-class classification, opting instead to train binary classifiers for each condition separately and bundle them into a single device. Challenges with evaluating multi-class classification devices include designing an efficient study for evaluating device clinical accuracy for multiple conditions—especially when some have low prevalence—evaluating the benefit-to-harm trade-offs of true and false test positives and true and false test negatives for each condition, and developing an appropriate statistical analysis plan for multiple hypothesis testing of each condition and possible combination of conditions.

AI is also being used to develop medical devices that quantify the risk of developing a target condition by a future time. Risk prediction models are evaluated for how good the predictions are. Calibration refers to how well the number of events predicted agree with the number of events observed in a prospectively sampled cohort. Risk predictions should be well calibrated, otherwise they may lead to inappropriate clinical management. In “Calibration of Prognostic Risk Scores,” published in Wiley StatsRef, Ben Van Calster and Ewout Steyerberg define mean, weak, moderate, and strong risk calibration. An open question is which definition of risk calibration should be considered when developing acceptance criteria for validating risk predictions as well calibrated.

Currently, most devices employ fixed algorithms that provide the same output each time the same input is provided. Soon, however, AI devices—particularly those employing ML—may be designed to be updated periodically or continuously as they learn from accumulating data.

How to evaluate the performance of a learning medical device is an open question. The FDA Center for Devices and Radiological Health issued the 2019 discussion paper, “Proposed Regulatory Framework for Modifications to Artificial Intelligence/Machine Learning (AI/ML)–Based Software as a Medical Device,” in which CDRH describes a potential approach for how updates to AI/ML software as a medical device, including those based on periodic or continuous learning, could be regulated. CDRH envisions using a predetermined change control plan that pre-specifies the scope of the anticipated device modifications and an algorithm change protocol specifying procedures and testing to be followed after each modification is made so the device remains safe and effective.

Post-market monitoring may be important for mitigating the risk of AI device performance deterioration. For example, performance may deteriorate as data drifts for devices with fixed ML algorithms, which may occur as populations or medical practices change over time. For continuously learning devices, an unrepresentative data stream could bias the learning process, with concomitant performance deterioration in device updates.

Jean Feng, Scott Emerson, and Noah Simon proposed statistical procedures for monitoring a sequence of device updates in “Approval Policies for Modifications to Machine Learning–Based Software as a Medical Device: A Study of Bio-Creep,” published in Biometrics. In “Monitoring Machine Learning (ML)–Based Risk Prediction Algorithms in the Presence of Confounding Medical Interventions,” Feng and Alexej Gossmann, Gene Pennello, Nicholas Petrick, Berkman Sahiner, and Romain Pirracchio proposed procedures to monitor devices for changepoints in real-world clinical accuracy and utility after adjusting for confounding medical interventions.

While the potential for performance deterioration is possible with any medical device, it is worth special consideration for AI devices given the data-driven nature in which they are trained.

Many issues with training and validating AI models for health care have been discussed in the literature. Some of these issues include the following:

1. Uninterpretability of “black box” ML models, especially for making high-stakes decisions

2. High uncertainty of predictions from unregularized models fitted in a high-dimensional prediction space

3. Reproducibility of AI model results (i.e., variation across repeated measures of the input data)

4. Confounding, an AI model’s reliance on a spurious association with outcome

5. Nonrepresentative training data leading to bias, poor overall out-of-sample performance, and/or lack of generalizability of performance across subgroups (i.e., fairness), the lack of which may lead to health care disparities

6. Nonstandard data for training or validation:

  • Synthetic data generated by a generative adversarial network or other generative model trained on private data to enable data sharing with differential privacy guarantee
  • Data reuse, validating an updated model on the same data set on which the original model was validated

7. Imperfect reference standard

  • Misclassification of ground truth of the target condition in some subjects (e.g., natural language processing may be imperfect for deriving phenotype ground truth)
  • Weak supervised learning from training data with some reference ground truth labels missing or incorrect or with coarsened reference ground truth labels

8. Performance deterioration of a fixed AI model because of data drift, for example, or of a learning AI model because of highly unrepresentative or adversarial cases appearing in the data stream, for example

Many medical device AI algorithms have so many parameters that they are essentially uninterpretable, hence sometimes called “black box.” For example, in radiology, a deep neural network employed for a medical imaging task may involve a huge number of parameters consisting of weights and biases embedded in multiple hidden layers. Evaluating whether an uninterpretable AI algorithm provides clinically significant results in validation data may be problematic because clinical understanding of the parameter space, much less the algorithm itself, is lacking. In particular, the more complex an algorithm is, the more likely it has complex failure modes that are hard to identify, especially in smaller size studies commonly used for validation.

In contrast, an assay for a target condition based on measuring a single biomarker in a well-understood specimen matrix (material or medium within which the biomarker is measured such as serum, tissue, or fluid) using stable in vitro diagnostic products (e.g., reagents, calibrators, and controls) can be relatively easy to understand clinically. The single biomarker simplifies the evaluation of whether the assay provides clinically significant results, increasing the likelihood it will be adopted into clinical practice.

Attempts have been made to explain uninterpretable AI output by using perturbation or gradient-based methods post hoc to obtain metrics (e.g., LIME, SHAP, or GradCAM) that quantify which input variables contributed the most to the output.

However, Satyapriya Krishna, Tessa Han, Alex Gu, Javin Pombra, Shahin Jabbari, Steven Wu, and Himabindu Lakkaraju indicate in “The Disagreement Problem in Explainable Machine Learning: A Practitioner’s Perspective,” current methods often disagree about which input variables were most important to an output. Whether disagreement of explainability methods indicate deficiencies in these methods or lack of robustness of the particular AI model appears to be an open question. Similarly, Amirata Ghorbani, Abubakar Abid, and James Zou, in “Interpretation of Neural Networks Is Fragile,” published in the Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, showed feature importance maps that explain which features in an image led a neural network to a particular classification can be unstable.

Absent clinical understanding of a complex AI device algorithm, statistics, then, may need to provide the foundation for performance evaluation. Fortunately, many of the same statistical principles employed for the design, conduct, and analysis of performance evaluation studies of medical products in general may be applied to AI devices in particular.

AI algorithms will play an increasingly important role in medical devices for the foreseeable future. However, enthusiasm for medical device AI could be tempered if challenges are not addressed with development (e.g., high uncertainty of some AI outputs) and evaluation (e.g., of learning devices). Clinical uses of AI in medical devices are anticipated to grow and may only be limited by the imagination of developers and data available. With new clinical uses will come new challenges in design, conduct, and analysis of performance evaluation studies, creating opportunities for statisticians to develop novel designs and evaluation methodologies to address each new clinical use. Many opportunities exist for statisticians to play a role in AI algorithm development and evaluation.

Further Reading

“Improving Case Definition of Crohn’s Disease and Ulcerative Colitis in Electronic Medical Records Using Natural Language Processing: A Novel Informatics Approach,” Inflammatory Bowel Diseases

“Screening: A Risk-Based Framework to Decide Who Benefits from Screening,” Nature Reviews Clinical Oncology

“Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments,” arXiv

“Multiparametric Quantitative Imaging Biomarkers for Phenotype Classification: A Framework for Development and Validation,” Academic Radiology

“Development and Evaluation of Safety and Effectiveness of Novel Cancer Screening Tests for Routine Clinical Use with Applications to Multicancer Detection Technologies,” Cancer

“Statistics in the Big Data Era: Failures of the Machine,” Statistics & Probability Letters

“Is There a Role for Statistics in Artificial Intelligence?” Advances in Data Analysis and Classification

“Test Data Reuse for the Evaluation of Continuously Evolving Classification Algorithms Using the Area Under the Receiver Operating Characteristic Curve,” SIAM Journal on Mathematics of Data Science

“Accounting for Misclassification in Electronic Health Records–Derived Exposures Using Generalized Linear Finite Mixture Models,”Health Service Outcomes Research Methodology

International Medical Device Regulators Forum Working Group on Software as a Medical Device (SaMD) Working Group Guidelines

“A Review of Adversarial Attack and Defense for Classification Methods,” The American Statistician

“A Unified Approach to Interpreting Model Predictions,” arXiv

“Predictably Unequal: Understanding and Addressing Concerns That Algorithmic Clinical Prediction May Increase Health Disparities,” NPJ Digital Medicine

“Discussion on ‘Approval Policies for Modifications to Machine Learning–Based Software as a Medical Device: A Study of Bio-Creep,'” Biometrics

“Stop Explaining Black Box Machine Learning Models for High-Stakes Decisions and Use Interpretable Models Instead,” Nature Machine Intelligence

“Explanations of Machine Learning Models in Repeated Nested Cross-Validation: An Application in Age Prediction Using Brain Complexity Features,” Applied Sciences

Towards a Standard for Identifying and Managing Bias in Artificial Intelligence NIST Special Publication

“A Quality Assessment Tool for Artificial Intelligence–Centered Diagnostic Test Accuracy Studies: QUADAS-AI,” Nature Medicine

“Differentially Private Synthetic Mixed-Type Data Generation for Unsupervised Learning,” Intelligent Decision Technologies

AI for Medical Imaging – Now? The ‘Doctor’ Will See You Now …,” Towards Data Science

Real-World Evidence and Clinical Utility of KidneyIntelX on Patients with Early-Stage Diabetic Kidney Disease: Interim Results on Decision Impact and Outcomes,” Journal of Primary Care Community Health

Summary for BriefCase, US Food and Drug Administration

Summary for FractureDetect (FX), US Food and Drug Administration

Summary for Viz ICH, US Food and Drug Administration

Summary for qER, US Food and Drug Administration

Summary for Caption Interpretation Automated Ejection Fraction, US Food and Drug Administration

Summary for QuantX, US Food and Drug Administration

Summary for ContaCT, US Food and Drug Administration

Summary for OsteoDetect, US Food and Drug Administration

Public Listing of FDA-Approved or Cleared Artificial Intelligence and Machine Learning (AI/ML)–Enabled Medical Devices, US Food and Drug Administration

Software as a Medical Device (SAMD): Clinical Evaluation, US Food and Drug Administration

Press Announcement: “FDA Permits Marketing Clinical Decision Support Software Alerting Providers of a Potential Stroke in Patients,” US Food and Drug Administration

“Artificial Intelligence and Machine Learning (AI/ML) Software as a Medical Device Action Plan: Discussion Paper,” US Food and Drug Administration

Clinical Decision Support Software: Guidance for Industry and Food and Drug Administration Staff, US Food and Drug Administration

Marketing Submission Recommendations for a Predetermined Change Control Plan for Artificial Intelligence/Machine Learning (AI/ML)–Enabled Device Software, US Food and Drug Administration

“Variable Generalization Performance of a Deep Learning Model to Detect Pneumonia in Chest Radiographs: A Cross-Sectional Study,” PLoS Med

“A Brief Introduction to Weakly Supervised Learning,”National Science Review

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)

Leave your response!

Add your comment below, or trackback from your own site. You can also subscribe to these comments via RSS.

Be nice. Keep it clean. Stay on topic. No spam.

You can use these tags:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

This is a Gravatar-enabled weblog. To get your own globally-recognized-avatar, please register at Gravatar.