跳转到主要内容
搜索

2025 年 8 月 1 日

Editor’s Note: Shuai, Fu, Floris Fauchet, Artak Khachatryan, Ananth Kadambi, Matt Zierhut and Anna Largajolli also contributed to this article.

Machine learning in healthcare shows great potential as a tool to transform how clinical and real-world patient data are analyzed to inform treatment decisions and improve patient outcomes. This blog highlights three Certara-led examples of machine learning models in healthcare — including metastatic relapse prediction in breast cancer, covariate selection using a new mlcov R package, and automation of model-based meta-analyses (MBMA) – that demonstrate how ML models can enhance data interpretation, accelerate insights, and support drug development. As the use of machine learning in healthcare continues to expand, ensuring model transparency, high-quality data, and expert-guided implementation remains essential for reliable and ethical decision-making.

Powerful tools to analyze clinical and real-world patient data

In recent years, health-related studies are fueled by the explosion of data collected from patients generated from diverse sources like clinical trials, electronic health records (EHRs), patient registries, social media, and wearable devices. The rich and growing body of data can provide valuable insights to help understand patient characteristics, treatment patterns, and patient outcomes from diverse patient populations and has the potential to significantly improve decision-making in healthcare. However, the breadth and complexity of data also presents methodological challenges related to the means to efficiently analyze and derive insights.

Statistical methods such as regression models, survival analysis, and propensity score matching have long been essential tools for analyzing patient data to identify patterns, estimate treatment effects, and predict patient outcomes. More recently, machine learning (ML) models have emerged as powerful alternatives to traditional statistical methods for analyzing large complex patient data. In this blog post, we’ll describe three use cases led by Certara scientists that utilize different ML methods to analyze patient data.

Machine learning (ml) models in healthcare are powerful tools to analyze patient data.

Using machine learning to predict patient outcomes: metastatic relapse in early-stage breast cancer

ML models were used successfully in several studies to predict patient outcomes and to forecast the probability of health events such as disease progression, adverse events, hospitalization and mortality. One example is a random forest model, an ensemble learning method that builds multiple decision trees and merges their predictions. 1

Figure 1: Workflow for covariate selection and mechanistic modeling to predict metastastic relapse in early-stage breast cancer

The first ML project we’ll discuss used a random forest algorithm to enhance the identification of key covariates for inclusion in mechanistic models.1 In this study the ML model predicted metastatic relapse in patients with early-stage breast cancer. Using a dataset of 642 patients and 21 clinical-pathological features, the authors conducted a random survival forest analysis to preselect a minimal set of the most predictive covariates. The random survival forest algorithm enabled the assessment of each covariate’s predictiveness by means of the forest-average minimal depth, a metric that quantifies a covariate’s importance based on its positioning within the trees of the forest. The preselected covariates were then evaluated in the mechanistic model using a backward selection procedure, thus reducing computational time and the risk of overfitting compared to performing a backward elimination with the full set of available covariates. This model was shown to accurately fit the data with a predictive performance comparable to Cox regression. The authors proposed a personalized prediction tool for routine management of patients with breast cancer by providing informative estimates of the invisible metastatic burden at the time of diagnosis and forward simulations of metastatic growth.

Developing a new machine learning based R package “mlcov” for covariate selection

Ibtissem et al. 2 at Certara developed the mlcov R package for covariate selection. First, the dataset, consisting of empirical Bayesian estimates of individual parameters and covariate sets, is randomly split into five folds. Second, the covariate selection is performed by applying the Lasso algorithm3 to reduce irrelevant or redundant covariates followed by the Boruta algorithm4 to identify relevant covariates. Third, a voting mechanism across folds determines the final selected covariates based on their robustness. Finally, residual plots are employed to evaluate the covariate-parameter relationships. After the covariate selection, an XGboost model is trained on the selected covariates with the remaining trends between residuals and unselected covariates examined to detect any significant trends or relationships that could be captured by additional covariates.

This framework was evaluated using few patient data examples and compared with the traditional stepwise covariate modeling (SCM) approach. In one example, covariate impact was tested on clearance (CL/F), volume of distribution (V/F), and absorption rate constant (Ka), and the same set of covariates were tested for both SCM and mlcov:

  • CL/F: weight (WGT), albumin (ALB), creatinine clearance (CRCL), sex, race and ethnicity;
  • V/F: WGT, ALB, sex, race and ethnicity
  • Ka: age, formulation (FORM), device.

While SCM identified ethnicity and sex for CL/F, mlcov did not, likely due to their correlations with race and body weight, respectively. For Ka, SCM identified the variable, device for Ka, while mlcov did not identify any covariate, with no trends in the residual plots. It is noteworthy to mention that the identified variable “device” by SCM did not demonstrate any significant impact on the extent of absorption (shown in the bioequivalence study).

The potential of ML methods to expand implementation of Model-based meta-analysis (MBMA)

Model-based meta-analysis or MBMA is a quantitative approach that uses published summary-level clinical data along with sponsors’ internal data. As a standard tool for the model-informed drug development (MIDD) framework, it can be applied to inform key drug development decisions. MBMA overcomes many limitations of traditional meta-analytic methods by providing a quantitative framework leveraging pharmacological information (dose, time, population characteristics, endpoint correlations, etc.) to explain observed variability in trial outcomes and allowing for appropriate comparisons of treatment options within a disease area. A standard MBMA includes data from randomized clinical trials, but there has been a growing interest in applying to real-world data5. A hierarchical structure can be considered to incorporate the heterogeneity between treatments, studies and study designs6.

Constructing databases is a major impediment to conducting MBMA. ML methods have great potential to further expand the implementation of MBMA in drug development by, for example, automating many steps of the systematic review process, including search, screening, data extraction and augmentation. These methods have been utilized in the AI-powered CODEX platform7 that creates and maintains highly curated, indication-specific clinical outcomes databases.

The promise of machine learning models in healthcare to improve decision-making and patient care

The three applications referenced here illustrate the potential ML models possess as powerful tools to extract, analyze and interpret patient data.

However, it should be noted that patient data are inherently complex, and the performance of ML models is directly influenced by the quality of the data used for training. ML model performance could be impaired by missing data, errors in data entry, and biases in the dataset to generalize reliable predictions. MBMA results can also be significantly impacted by potential bias from published summary-level data. This includes issues like the poor quality of study design, the lack of publications from studies with negative outcomes, and errors during data extraction, etc. The AI-powered CODEX platform noted above offers one potential solution for ensuring the quality of the database.

Another challenge is that many ML models, particularly deep learning algorithms, are more difficult to interpret than traditional statistical models such as regression models. A lack of transparency of the ML models would make it challenging to understand how decisions are made based on these models, which can hinder their use in guiding clinical decisions or informing public health policy. In healthcare, where decisions can have serious or even life-threatening consequences, it is essential to ensure that ML models are both reliable and explainable. Therefore, accountability and transparency of ML algorithms are critical to evaluate model performance and build trust among researchers, clinicians, policymakers and patients. This includes understanding not only the input data if they can be generalized to the wider population as well as the ML model features which could influence study results. While there are a few existing tools such as “InterpretML”8 and “AI Explainability 360”9 that can help interpret ML model results, human intervention remains essential for careful design of ML models, thoughtful interpretation and communication of analysis results for successful and responsible application in healthcare studies.

In conclusion, ML models provide powerful tools to analyze and interpret patient data collected in both clinical trials and real-world settings. When used appropriately, these models can support clinical decision-making and enhance patient care. However, the successful application of ML models in healthcare studies relies on access to high-quality, representative data, careful model selection, and the ability to generate interpretable and accountable analysis results.

Certara’s experts in Real-World Evidence and Modeling can help assess whether an ML model is the right solution for your project and provide guidance on design and implementation.

Learn moreContact our experts

Frequently Asked Questions (FAQs)

How can machine learning be used in healthcare?

Machine learning in healthcare can help researchers and clinicians draw faster, more reliable insights from large datasets, ultimately supporting better patient outcomes and more efficient drug development.

What role does machine learning play in healthcare data analysis?

Machine learning can be used in healthcare to analyze complex clinical and real-world patient data, enabling earlier and more accurate predictions, treatment personalization, and improved decision-making.

Why use machine learning in healthcare?

Machine learning (ML) in healthcare offers a powerful way to analyze complex, large-scale clinical and real-world patient data, enabling more accurate predictions, personalized treatments, and informed decision-making. ML models outperform traditional statistical methods in identifying key patient covariates, predicting clinical outcomes (such as metastatic relapse in early-stage breast cancer), and automating data-intensive processes like systematic literature reviews for model-based meta-analyses. These capabilities allow healthcare researchers and developers to efficiently uncover insights, reduce bias, and enhance the reliability and scalability of evidence generation — all critical for advancing patient care and drug development.

参考文献

1 Nicolò C, Périer C, Prague M, Bellera C, MacGrogan G, Saut O, et al. Machine Learning and Mechanistic Modeling for Prediction of Metastatic Relapse in Early-Stage Breast Cancer. JCO Clin Cancer Inform. 2020;4:259-74.

2 Ibtissem Rebai VD, Ayman Akil, James Craig, Mike Talley, Anna Largajolli*, Floris Fauchet*. mlcov: New Machine Learning Based R package for Covariate Selection. PAGE2024; 2024.

3 Muthukrishnan R, Rohini R, editors. LASSO: A feature selection technique in predictive modeling for machine learning. 2016 IEEE international conference on advances in computer applications (ICACA); 2016: Ieee.

4 Kursa MB, Rudnicki WR. Feature selection with the Boruta package. Journal of statistical software. 2010;36:1-13.

5 Chen W, Li L, Ji S, Song X, Lu W, Zhou T. Longitudinal model–based meta-analysis for survival probabilities in patients with castration-resistant prostate cancer. European Journal of Clinical Pharmacology. 2020;76:589-601.

6 Gonzalez RDLF, Cabra A, Liu D, Gueco M, Naslazi E, Fu S, et al. Comparative Safety of Ultrasound Enhancing Agents: A Systematic Review and Bayesian Network Meta-Analysis: Comparative Safety Evaluation of Optison. The American Journal of Cardiology. 2024.

7 Certara. CODEX [cited 2025 27 Feb]. Available from: https://codex.certara.com/codex/.

8 RAJESH M, REDDY MS, D MADHU KHR, SAI K, KUMAR KS, SINGH DBR. Interpretml: A unified framework for machine learning interpretability. 2019.

9 Arya V, Bellamy RK, Chen P-Y, Dhurandhar A, Hind M, Hoffman SC, et al., editors. Ai explainability 360 toolkit. Proceedings of the 3rd ACM India joint international conference on data science & management of data (8th ACM IKDD CODS & 26th COMAD); 2021.

Nina Shigesi

Epidemiology Technical Consultant

Nina Shigesi is a technical consultant at Certara for the Real-World Evidence Solutions practice. She holds a PhD in Epidemiology from the University of Oxford. Nina has worked on a diverse range of projects analyzing real-world data, including data from electronic health records, claims and patient registries. She is an author and co-author of several scientific publications on real-world evidence studies.

Chiara Nicoló

Scientist, Pharmacometrics

Chiara Nicoló is a consulting scientist in pharmacometrics at Certara Drug Development Solutions. She holds a degree in Mathematics from the University of Trento and earned a PhD in Applied Mathematics from the University of Bordeaux, where she focused on modeling in oncology.

Ibtissem Rebai

Scientist, Pharmacometrics

Ibtissem Rebai is a consulting scientist in pharmacometrics at Certara Drug Development Solutions. She has expertise in mathematics and modelling, with a master’s degree in mathematical engineering and Biostatistics from Université Paris Cité. Ibtissem also earned a master’s degree in data science from Université Paris-Saclay. At Certara, she’s worked on a range of projects analysing pharmacokinetic and patient data with statistical and machine learning models.

联系我们


沪ICP备2022021526号

Powered by Translations.com GlobalLink Web Software