Leveraging Automated Machine Learning for the Analysis of Global Public Health Data: A Case Study in Malaria

approach to

weeks preceding the visit, reported use of repellent, and whether the visit occurred during the rainy season. There were five outcomes of interest (malaria detected by PCR, or by microscopy, submicroscopic/symptomatic/asymptomatic malaria) and analyses were stratified by site in India (Chennai/Nadiad,/ Rourkela) and by Plasmodium species (falciparum/vivax/any), for a total of 45 analyses. We analyzed the same stratified combinations of targets and features using TPOT. As an illustration, we describe our TPOT results for one of these combinations, namely the analysis on individuals from Nadiad where the (binary) target is whether the individual had malaria (any species) detected by microscopy (58 cases, 793 controls). None of the features listed above was identified as significantly associated to the outcome in the regression analyses reported in [14]. We used TPOT to explore three increasingly complex types of pipelines: i) a single regularized logistic regression (LR) step, or ii) a combination of feature selection, feature transformation, and LR steps, or iii) a combination of feature selection, feature transformation and classification steps. For each of these three types, we ran TPOT 50 times with different random splits of the input data into training (75%) and hold-out testing (25%) portions. Moreover, to mitigate the effect of the high imbalance between number of cases and controls, in each run we randomly undersampled the controls to equal the number of cases prior to the random split. Figure 1 summarizes the results, where each point represents one of the 50 runs of TPOT exploring pipelines of the type indicated by its color. On the y-axis the accuracy on the testing set of the TPOT-optimized pipeline for that run is indicated. The Kruskal-Wallis test did not detect a significant difference in the three distributions. For this dataset, simple pipelines of type (i) achieve good microscopic malaria prediction accuracy on average (mean 0.68), when the regularized LR hyperparameters are tuned. Models of type (i) generalize those used in [14], and are easily interpretable. The best accuracy across these 50 TPOT runs is 0.86. On the other hand, the best accuracies across type (ii) and (iii) runs are ∼90%. There is a tradeoff between interpretability and complexity. Figure 2 depicts the architecture of the best pipeline of type (iii). This has a higher accuracy than the best LR pipeline but is quite complex and it  would have unlikely been discovered without an AutoML approach. The most relevant features in this pipeline, in terms of driving the predictive model, were gender and antimalarial use, based on permutation importance, which is a standard ML method to aid in model interpretability, described at https:// eli5.readthedocs.io/en/latest/blackbox/permutation_importance. html#eli5-permutation-importance.
This example underscores the utility of AutoML approaches in epidemiology, especially those offering to non-expert users the ability to specify the type of pipelines to explore, from very simple to very complex, at the same time leaving the heavy lifting to the AutoML. We add that the most recent extension of TPOT enables covariate adjustments [10] which in some epidemiology settings is crucial. Embedding AutoML tools within epidemiology platforms like ClinEpiDB would empower users to directly perform sophisticated analyses, accelerating the benefits derived from these public health resources.

DATA AVAILABILITY STATEMENT
Publicly available datasets were analyzed in this study. This data can be found here: https://clinepidb.org/ce/app/record/dataset/ DS_a5c969d5fa.

AUTHOR CONTRIBUTIONS
JM conceived and supervised the project. EM analyzed the data and drafted the manuscript. Both authors reviewed and edited the manuscript. Both authors read and approved the final manuscript.

FUNDING
National Institutes of Health grant AI116794. The funding body had no role in the design of the study and collection, analysis, and interpretation of data, or in writing the manuscript.