Summary
Machine learning and deep learning, integral components of artificial intelligence, involve the process of training computers to learn and make informed decisions based on diverse datasets. Notably, recent strides in artificial intelligence predominantly emanate from the realm of deep learning, proving transformative across various domains, ranging from computer vision to health sciences. The impact of deep learning on medical practices has been particularly groundbreaking, reshaping traditional approaches to clinical applications. While certain medical subfields, such as pediatrics, initially lagged in harnessing the substantial advantages of deep learning, there is a noteworthy accumulation of related research in pediatric applications.
This paper undertakes a comprehensive review of newly developed machine learning and deep learning solutions tailored for neonatology applications. The assessment systematically explores the roles played by both classical machine learning and deep learning in neonatology, elucidating methodologies, including algorithmic advancements. Furthermore, it delineates the persisting challenges in evaluating neonatal diseases, adhering to PRISMA 2020 guidelines. The predominant areas of focus within neonatology’s AI applications encompass survival analysis, neuroimaging, scrutiny of vital parameters and biosignals, and the diagnosis of retinopathy of prematurity.
Drawing on 106 research articles spanning from 1996 to 2022, this systematic review meticulously categorizes and discusses their respective merits and drawbacks. The overarching goal of this study is to augment comprehensiveness, delving into potential directions for novel AI models and envisioning the future landscape of neonatology as AI continues to assert its influence. The narrative concludes by proposing roadmaps for the seamless integration of AI into neonatal intensive care units, foreseeing transformative implications for the field.
Commencement
The ongoing surge of artificial intelligence (AI) is reshaping diverse sectors, healthcare included, in an ever-evolving landscape. The dynamic nature of AI applications makes it challenging to keep pace with the constant innovations. Despite the profound impact of AI on daily life, many healthcare practitioners, especially in less-explored domains like neonatology, may not fully grasp the extent of AI integration into contemporary healthcare systems. This review aims to bridge this awareness gap, specifically targeting physicians navigating the intricate intersection of AI and neonatology.
The roots of AI, particularly within machine learning (ML), can be traced back to the 1950s, when Alan Turing conceptualized the “learning machine” alongside early military applications of rudimentary AI. During this era, computers were colossal, and the costs associated with expanding storage were exorbitant, limiting their capabilities. Over the ensuing decades, incremental advancements in both theoretical frameworks and technological infrastructure progressively enhanced the potency and adaptability of ML.
Understanding the mechanics of ML and its subset, deep learning (DL), is fundamental. ML, as a subset of AI, garnered attention due to its adeptness in handling data. ML algorithms and models possess the ability to learn from data, scrutinize, assess, and formulate predictions or decisions based on acquired insights. DL, a specialized form of ML, draws inspiration from the human brain’s neural networks, replicating their functionality through artificial neurons in computer neural networks. The distinctive feature of DL lies in its hierarchical architecture, facilitating the autonomous extraction of features from data, a departure from the reliance on human-engineered features in conventional ML.
Distinguishing ML from DL hinges on the complexity of models and the scale of datasets they can manage. ML algorithms prove effective across a spectrum of tasks, exhibiting simplicity in training and deployment. Conversely, DL algorithms necessitate larger datasets and intricate models but excel in tasks involving high-dimensional, intricate data, automatically identifying significant aspects without predefined elements of interest. The non-linear activation functions within DL architectures, such as artificial neural networks (ANN), contribute to the learning of complex features representative of provided data samples.
ML and DL fall into categories such as supervised, unsupervised, or reinforcement learning based on the nature of the input-output relationship. Their applications span classification, regression, and clustering tasks. DL’s success is contingent on large-scale data availability, innovative optimization algorithms, and the accessibility of Graphics Processing Units (GPUs). These autonomous learning algorithms, mirroring human learning processes, position DL as a pioneering ML method, catalyzing substantial transformations across medical and technological domains, marking it as the driving force propelling contemporary AI advancements.
![](https://360world.tech/wp-content/uploads/2023/12/image-1024x446.png)
In the field of deep learning (DL) applied to medical imaging, three primary problem categories exist: image segmentation, object detection (identifying organs or other anatomical/pathological entities), and image classification (e.g., diagnosis, prognosis, therapy response assessment)3. Numerous DL algorithms are commonly utilized in medical research, falling into specific algorithmic families:
- Convolutional Neural Networks (CNNs): Mainly applied in computer vision and signal processing tasks, CNNs excel in tasks involving fixed spatial relationships, such as imaging data. The architecture comprises phases (layers) facilitating the acquisition of hierarchical features. Initial phases extract local features (corners, edges, lines), while subsequent phases extract more global features. Feature propagation involves nonlinearities and regularizations, enriching feature representation. Pooling operations reduce feature size, and the resulting features are employed for predictions (segmentation, detection, or classification)3,16.
- Recurrent Neural Networks (RNNs): Tailored for retaining sequential data like text, speech, and time-series data (e.g., clinical or electronic health records). RNNs capture temporal relationships, aiding in predicting disease progression or treatment outcomes11,17,18. Long Short-Term Memory (LSTM) models, a subtype of RNNs, address limitations by learning long-term dependencies more effectively. LSTMs use a gated memory cell to store information, allowing them to learn complex patterns, particularly useful in audio classification17,19.
- Generative Adversarial Networks (GANs): A DL model class used for generating new data resembling existing datasets. In healthcare, GANs create synthetic medical images using two CNNs (generator and discriminator). The generator produces synthetic images mimicking real ones, while the discriminator identifies artificially generated images. Adversarial training ensures the generator generates realistic data. GANs are versatile, employed for tasks like image enhancement, signal reconstruction, classification, and segmentation20,21,22.
- Transfer Learning (TL): Rooted in cognitive science, TL minimizes annotation needs by transferring knowledge from pretrained models. However, using ImageNet pre-trained models for medical image classification can be inefficient due to differences between natural images and medical images25,26. Fine-tuning more layers in CNNs may improve accuracy, as the initial layers of ImageNet-pretrained networks may not efficiently detect low-level characteristics in medical images25,26.
- Advanced DL Algorithms: Evolving daily, new methods such as Capsule Networks, Attention Mechanisms, and Graph Neural Networks (GNNs)27,28,29,30 are employed for imaging and non-imaging data analysis. Capsule Networks address CNN shortcomings, Attention Mechanisms enhance context understanding, and GNNs exhibit potential in both imaging and non-imaging data analysis28,34.
In the imaging field, both data-driven and physics-driven systems are essential. While DL methods show effectiveness, challenges persist, particularly in MRI construction where data-driven and physics-driven algorithms are employed based on the impracticality of acquiring fully sampled datasets.
The concept of Hybrid Intelligence, combining AI with human intellect, offers a promising collaborative approach. AI processes extensive data rapidly, while human expertise contributes context and intuition, leading to more precise decision-making. Hybrid intelligence systems hold promise for time-consuming tasks in healthcare and neonatology.
In the current landscape, AI in medicine has been in use for over a decade, with advancements in algorithms and hardware technologies contributing to its potential. Challenges remain as healthcare utilization increases, particularly in integrating AI into daily clinical practice.
Clinicians seek enhanced diagnostic tools from AI, anticipating reduced invasive tests and increased diagnostic accuracy. This systematic review aims to explore AI’s potential role in neonatology, envisioning a future where hybrid intelligence optimizes neonatal care. The study outlines AI applications, evaluation metrics, and challenges in neonatology, providing a comprehensive overview. The paper’s objectives encompass thorough explanations of AI models, categorization of neonatology-related applications, and discussions of challenges and future research directions.
To elucidate a comprehensive understanding, the objectives are:
- Provide a detailed explanation of various AI models and thoroughly articulate evaluation metrics, elucidating the key features inherent in these models.
- Categorize AI applications pertinent to neonatology into overarching macro-domains, expounding on their respective sub-domains and highlighting crucial aspects of the applicable AI models.
- Scrutinize the contemporary landscape of studies, especially those in recent years, with a specific focus on the widespread utilization of machine learning (ML) across the entire spectrum of neonatology.
- Furnish an exhaustive and well-structured overview, including a classification of Deep Learning (DL) applications deployed within the realm of neonatology.
- Assess and engage in a discourse concerning prevailing challenges linked to AI implementation in neonatology, along with contemplating future avenues for research. This aims to provide clinicians with a comprehensive perspective on the current scenario.
![](https://360world.tech/wp-content/uploads/2023/12/image-1-1024x465.png)
Elaborating on AI Models and Evaluation Metrics
Assessing Studies Applying ML in Neonatology
Assessing Studies Applying DL in Neonatology
Scrutinizing Challenges and Mapping Future Directions.
I encompass a comprehensive concept that involves the application of computational algorithms capable of categorizing, predicting, or deriving valuable conclusions from vast datasets. Over the past three decades, various algorithms, including Naive Bayes, Genetic Algorithms, Fuzzy Logic, Clustering, Neural Networks (NN), Support Vector Machines (SVM), Decision Trees, and Random Forests (RF), have been utilized for tasks such as detection, diagnosis, classification, and risk assessment in the field of medicine. Traditional machine learning (ML) methods often incorporate hand-engineered features, which consist of visual descriptions and annotations learned from radiologists and are encoded into algorithms for image classification.
Medical data encompasses diverse unstructured sources such as images, signals, genetic expressions, electronic health records (EHR), and vital signs (refer to Fig. 3). The intricate structures of these data types allow deep learning (DL) frameworks to leverage their heterogeneity, achieving high levels of abstraction in data analysis.
![](https://360world.tech/wp-content/uploads/2023/12/image-2-1024x597.png)
While machine learning (ML) necessitates manual selection and crafted transformation of information from incoming data, deep learning (DL) executes these tasks more efficiently and with heightened efficacy. DL achieves this by autonomously uncovering components through the analysis of a large number of samples, a process that is highly automated. The literature extensively covers ML approaches predating the advent of DL.
For clinicians, understanding how the recommended ML model enhances patient care is crucial. Given that a single metric cannot encapsulate all desirable attributes, it is customary to describe a model’s performance using various metrics. Unfortunately, end-users often struggle to comprehend these measurements, making it challenging to objectively compare models across different research endeavors. Currently, there is no available method or tool to compare models based on the same performance measures. This section elucidates common ML and DL evaluation metrics to empower neonatologists in adapting them to their research and understanding upcoming articles and research design.
The widespread application of artificial intelligence (AI) spans daily life to high-risk medical scenarios. In neonatology, the incorporation of AI has been a gradual process, with numerous studies emerging in the literature. These studies leverage various imaging modalities, electronic health records, and ML algorithms, some of which are still in the early stages of integration into clinical workflows. Despite the absence of specific systematic reviews and future discussions in this field, several studies have focused on introducing AI systems to neonatology. However, their success has been limited. Recent advancements in DL have, however, shifted research in this field in a more promising direction. Evaluation metrics in these studies commonly include standard measures such as sensitivity (true-positive rate), specificity (true-negative rate), false-positive rate, false-negative rate, receiver operating characteristics (ROC), area under the ROC curves (AUC), and accuracy (Table 1).
Term | Definition |
---|---|
True Positive (TP) | The number of positive samples correctly identified. |
True Negative (TN) | The number of samples accurately identified as negative. |
False Positive (FP) | The number of samples incorrectly identified as positive. |
False Negative (FN) | The number of samples incorrectly identified as negative. |
Accuracy (ACC) | The proportion of correctly identified samples to the total sample count in the assessment dataset. The accuracy is limited to the range [0, 1], where 1 represents properly predicting all positive and negative samples, and 0 represents successfully predicting none of the positive or negative samples. |
Recall (REC) | Also known as sensitivity or True Positive Rate (TPR), it is the proportion of correctly categorized positive samples to all samples allocated to the positive class. Computed as the ratio of correctly classified positive samples to all samples assigned to the positive class. |
Specificity (SPEC) | The negative class form of recall (sensitivity) and reflects the proportion of properly categorized negative samples. |
Precision (PREC) | The ratio of correctly classified samples to all samples assigned to the class. |
Positive Predictive Value (PPV) | The proportion of correctly classified positive samples to all positive samples. |
Negative Predictive Value (NPV) | The ratio of samples accurately identified as negative to all samples classified as negative. |
F1 score (F1) | The harmonic mean of precision and recall, eliminating excessive levels of either. |
Cross Validation | A validation technique often employed during the training phase of modeling, without duplication among validation components. |
AUROC (Area under ROC curve – AUC) | A function representing the effect of various sensitivities (true-positive rate) on the false-positive rate. Limited to the range [0, 1], where 1 represents properly predicting all cases and 0 represents predicting none of the cases. |
ROC | By displaying the effect of variable levels of sensitivity on specificity, it is possible to create a curve illustrating the performance of a particular predictive algorithm. |
Overfitting | Modeling failure indicating extensive training and poor performance on tests. |
Underfitting | Modeling failure indicating inadequate training and inadequate test performance. |
Dice Similarity Coefficient | Used for image analysis. Limited to the range [0, 1], where 1 represents proper segmentation of all images and 0 represents successfully segmenting none of the images. |
Results
This systematic review adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) protocol56. The search process was concluded on July 11, 2022. Initially, a substantial number of articles (approximately 9000) were identified, and a systematic approach was employed to screen and select pertinent articles based on their alignment with the research focus, study design, and relevance to the topic. After reviewing article abstracts, 987 studies were identified. Ultimately, our search resulted in the inclusion of 106 research articles spanning the years 1996 to 2022 (Fig. 4). The risk of bias was assessed using the QUADAS-2 tool.
![](https://360world.tech/wp-content/uploads/2023/12/image-3-1024x590.png)
![](https://360world.tech/wp-content/uploads/2023/12/image-7-1024x289.png)
Our discoveries are condensed into two sets of tables: Tables 2–5 outline AI techniques from the pre-deep learning era (“Pre-DL Era”) in neonatal intensive care units, categorized by the type of data and applications. Conversely, Tables 6 and 7 encompass studies from the DL Era, focusing on applications such as classification (prediction and diagnosis), detection (localization), and segmentation (pixel-level classification in medical images).
Study | Approach | Purpose | Dataset | Type of Data | Performance | Pros(+) | Cons(-) |
---|---|---|---|---|---|---|---|
Hoshino et al., 2017[^194^] | CLAFIC, logistic regression analysis | To determine optimal color parameters predicting Biliary atresia (BA) | Stools | 50 neonates | 30 BA and 34 non-BA images | 100% (AUC) | + Effective and convenient modality for early detection of BA, and potentially for other related diseases |
Dong et al., 2021[^195^] | Level Set algorithm | To evaluate postoperative enteral nutrition of neonatal high intestinal obstruction and analyze clinical treatment effect | 60 neonates | CT images | 84.7% (accuracy) | + Segmentation algorithm can accurately segment the CT image, displaying the disease location and its contour more clearly. | – EHR (not included AI analysis) – Small sample size – Retrospective design |
Ball et al., 2015[^90^] | Random Forest (RF) | To compare whole-brain functional connectivity in preterm newborns with healthy term-born neonates | 105 preterm infants and 26 term controls | Resting state functional MRI and T2-weighted Brain MRI | 80% (accuracy) | + Prospective + Connectivity differences between term and preterm brain | – Not well-established model |
Smyser et al., 2016[^88^] | Support vector machine (SVM)-multivariate pattern analysis (MVPA) | To compare resting state-activity of preterm-born infants to term infants | 50 preterm infants and 50 term-born control infants | Functional MRI data + Clinical variables | 84% (accuracy) | + Prospective GA at birth used as an indicator of the degree of disruption of brain development + Optimal methods for rs-fMRI data acquisition and preprocessing not rigorously defined | – Small sample size |
Zimmer et al., 2017[^93^] | NAF: Neighborhood approximation forest classifier of forests | To reduce the complexity of heterogeneous data population, manifold learning techniques are applied | 111 infants (NC, 70 subjects), affected by IUGR (27 subjects) or VM (14 subjects) | 3 T brain MRI | 80% (accuracy) | + Combining multiple distances related to the condition improves overall characterization and classification of the three clinical groups (Normal, IUGR, Ventriculomegaly) | – Lack of neonatal data due to challenges during acquisition and data accessibility – Small sample size |
Krishnan et al., 2017[^100^] | Unsupervised machine learning: Sparse Reduced Rank Regression (sRRR) | Variability in the Peroxisome Proliferator Activated Receptor (PPAR) pathway related to brain development | 272 infants born at less than 33 wk gestational age (GA) | Diffusion MR Imaging + Diffusion Tractography + Genome-wide Genotyping | 63% (AUC) | + Inhibited brain development controlled by genetic variables, and PPARG signaling plays a previously unknown cerebral function | – Further work required to characterize the exact relationship between PPARG and preterm brain development |
Chiarelli et al., 2019[^91^] | Multivariate statistical analysis | To better understand the effect of prematurity on brain structure and function | 88 newborns | 3 Tesla BOLD and anatomical brain MRI + Few clinical variables | – Multivariate analysis using motion information could not significantly infer GA at birth + Prematurity associated with bidirectional alterations of functional connectivity and regional volume | – Retrospective design – Small sample size | |
Song et al., 2017[^94^] | Fuzzy nonlinear support vector machines (SVM) | Neonatal brain tissue segmentation in clinical magnetic resonance (MR) images | 10 term neonates | Brain MRI T1 and T2 weighted | 70%–80% (dice score-gray matter) 65%–80% (dice score-white matter) | + Nonparametric modeling adapts to spatial variability in intensity statistics arising from variations in brain structure and image inhomogeneity + Produces reasonable segmentations even in the absence of atlas prior | – Small sample size |
Taylor et al., 2017[^137^] | Machine Learning | Technology that uses a smartphone application for effectively screening newborns for jaundice | 530 newborns | Paired BiliCam images + Total serum bilirubin (TSB) levels | High-risk zone TSB level was 95% for BiliCam and 92% for TcB (P = 0.30); for identifying newborns with a TSB level of ≥17.0, AUCs were 99% and 95%, respectively (P =0.09). | + Inexpensive technology that uses commodity smartphones for effective jaundice screening + Multicenter data + Prospective design | – Method and algorithm name not explained |
Ataer-Cansizoglu et al., 2015[^134^] | Gaussian Mixture Models | i-ROP To develop a novel computer-based image analysis system for grading plus diseases in ROP | 77 wide-angle retinal images | 95% (accuracy) | + Arterial and venous tortuosity (combined), and a large circular cropped image provided the highest diagnostic accuracy + Comparable to the performance of individual experts | – Used manually segmented images with a tracing algorithm to avoid possible noise and bias – Low clinical applicability | |
Rani et al., 2016[^133^] | Back Propagation Neural Networks | To classify ROP | 64 RGB images of these stages taken by RetCam with 120 degrees field of view and size of 640 × 480 pixels | 90.6% (accuracy) | – No clinical information – Requires better segmentation – Clinical adaptation | ||
Karayiannis et al., 2006[^101^] | Artificial Neural Networks (ANN) | To aim at the development of a seizure-detection system | 54 patients + 240 video segments | Each of the training and testing sets contained 120 video segments (40 segments of myoclonic seizures, 40 segments of focal clonic seizures, and 40 segments of random movements | 96.8% (sensitivity) 97.8% (specificity) | + Video analysis | – Not capable of detecting neonatal seizures with subtle clinical manifestations (Subclinical seizures) or neonatal seizures with no clinical manifestations (electrical-only seizures – No EEG analysis – Small sample size – No additional clinical information |
[^194^]: Hoshino et al., 2017
[^195^]: Dong et al., 2021
[^90^]: Ball et al., 2015
[^88^]: Smyser et al., 2016
[^93^]: Zimmer et al., 2017
[^100^]: Krishnan et al., 2017
[^91^]: Chiarelli et al., 2019
[^94^]: Song et al., 2017
[^137^]: Taylor et al., 2017
[^134^]: Ataer-Cansizoglu et al., 2015
[^133^]: Rani et al
Study | Approach | Purpose | Dataset | Type of data | Performance | Pros(+) | Cons(-) |
---|---|---|---|---|---|---|---|
Reed et al., 1996135 | Recognition-based reasoning | Diagnosis of congenital heart defects | 53 patients | Patient history, physical exam, blood tests, cardiac auscultation, X-ray, and EKG data | + Useful in multiple defects | – Small sample size, Not real AI implementation | |
Aucouturier et al., 2011148 | Hidden Markov model architecture (SVM, GMM) | Identify expiratory and inspiration phases from the audio recording of human baby cries | 14 infants, spanning four vocalization contexts in their first 12 months | Voice record | 86%-95% (accuracy) | + Quantify expiration duration, crying rate, and other time-related characteristics for screening, diagnosis, and research | – More data needed, No clinical explanation, Small sample size, Required preprocessing |
Cano Ortiz et al., 2004149 | Artificial neural networks (ANN) | Detect CNS diseases in infant cry | 35 neonates, nineteen healthy cases and sixteen sick neonates | Voice record (187 patterns) | 85% (accuracy) | + Preliminary result | – More data needed for correct classification |
Hsu et al., 2010151 | Support Vector Machine (SVM) Service-Oriented Architecture (SOA) | Diagnose Methylmalonic Acidemia (MMA) | 360 newborn samples | Metabolic substances data collected from tandem mass spectrometry (MS/MS) | 96.8% (accuracy) | + Better sensitivity than classical screening methods | – Small sample size, SVM pilot stage education not integrated |
Baumgartner et al., 2004152 | Logistic regression analysis (LRA), Support vector machines (SVM), Artificial neural networks (ANN), Decision trees (DT), k-nearest neighbor classifier (k-NN) | Focus on phenylketonuria (PKU), medium chain acyl-CoA dehydrogenase deficiency (MCADD) | Bavarian newborn screening program all newborns | Metabolic substances data collected from tandem mass spectrometry (MS/MS) | 99.5% (accuracy) | + ML techniques delivered high predictive power | – Lacking direct interpretation of knowledge representation |
Chen et al., 2013153 | Support vector machine (SVM) | Diagnose phenylketonuria (PKU), hypermethioninemia, and 3-methylcrotonyl-CoA-carboxylase (3-MCC) deficiency | 347,312 infants (220 metabolic disease suspect) | Newborn dried blood samples | 99.9% (accuracy) for each condition | + Reduced false positive cases | – Feature selection strategies did not include total features |
Temko et al., 2011105 | Support Vector Machine (SVM) classifier leave-one-out (LOO) cross-validation method | Measure system performance for neonatal seizure detection using EEG | 17 newborns | 267 hours clinical dataset | 89% (AUC) | + SVM-based system assists clinical staff in interpreting EEG | – No clinical variable, Difficult to obtain large datasets |
Temko et al., 2012104 | SVM | Use recent advances in clinical understanding of seizure burden in neonates with hypoxic ischemic encephalopathy to improve automated detection | 17 HIE patients | 816.7 hours EEG recordings | 96.7% (AUC) | + Improved seizure detection | – Small sample size, No clinical information |
Temko et al., 2013115 | Support Vector Machine (SVM) classifier leave-one-out (LOO) cross-validation method | Validate robustness of Temko 2011105 | Trained in 38 term neonates, Tested in 51 neonates | Trained in 479 hours EEG recording, Tested in 2540 hours | 96.1% (AUC), Correct detection of seizure burden 70% | – Small sample size, No clinical information | |
Stevenson et al., 2013116 | Multiclass linear classifier | Automatically grade one-hour EEG epoch | 54 full term neonates | One-hour-long EEG recordings | 77.8% (accuracy) | + Involvement of clinical expert, Method explained in detail | – Retrospective design |
Ahmed et al., 2016114 | Gaussian mixture model, Universal Background Model (UBM), SVM | Grade hypoxic–ischemic encephalopathy (HIE) severity using EEG | 54 full term neonates | One-hour-long EEG recordings | 87% (accuracy) | + Significant assistance to healthcare professionals | – Retrospective design |
Mathieson et al., 2016103 | Robusted Support Vector Machine (SVM) classifier leave-one-out (LOO) cross-validation method115 | Validate Temko 2013115 | 70 babies from 2 centers (35 Seizure, 35 Non-Seizure) | Seizure detection algorithm thresholds clinically acceptable range | Detection rates 52.5%–75% | + Clinical information and Cohen score added | – Retrospective design |
Mathieson et al., 2016198 | Support Vector Machine (SVM) classifier leave-one-out (LOO) cross-validation method.105 | Analyze Seizure detection Algorithm and characterize false negative seizures | 20 babies (10 seizure -10 non-seizure) (20 of 70 babies)103 | Seizure detections evaluated sensitivity threshold | + Clinical information and Cohen score added | – Retrospective design | |
Yassin et al., 2017150 | Locally linear embedding (LLE) | Explore autoencoders for diagnosing infant asphyxia from infant cry | 600 segmented signals (284 normal cries, 316 asphyxiated cries) | 100% (accuracy) | + 600 MFCC features distinguish normal and asphyxiated newborns | – No clinical information | |
Li et al., 2011136 | Fuzzy backpropagation neural networks | Establish early diagnostic system for hypoxic ischemic encephalopathy (HIE) | 140 cases (90 patients, 50 controls) | Medical records of newborns with HIE | 100% correct recognition rate for training samples, 95% for test samples | + High accuracy in early diagnosis of HIE | – Small sample size |
Zernikow et al., 199884 | ANN | Detect early occurrence of severe IVH in individual patient | 890 preterm neonates (50%, 50%) | Validation and training EHR | 93.5% (AUC) | + Observational study | – No image, Skipped variables during ANN training |
Ferreira et al., 2012138 | Decision trees and neural networks | Identify neonatal jaundice | 227 healthy newborns | 70 variables analyzed | 89% (accuracy), 84% (AUC) | + Predicting subsequent hyperbilirubinemia with high accuracy | – Not all factors contributing to hyperbilirubinemia included |
Porcelli et al., 2010228 | Artificial neural network (ANN) | Compare accuracy of birth weight–based weight curves with |
Certainly! Here is the rewritten table:
Study | Approach | Purpose | Dataset | Type of Data (Image/Non-Image) | Performance | Pros(+) | Cons(-) |
---|---|---|---|---|---|---|---|
Hauptmann et al., 2019187 | 3D (2D plus time) CNN architecture | Reconstruction of highly accelerated radial real-time data in patients with congenital heart disease | 250 CHD patients. Cardiovascular MRI with cine images | Image | +Potential use of a CNN for reconstructing real-time radial data | – | – |
Lei et al., 2022158 | MobileNet-V2 CNN | Detect PDA with AI | 300 patients 461 echocardiograms | Image | 88% (AUC) | +Diagnosis of PDA with AI | – |
Ornek et al., 2021189 | VGG16 (CNN) | Monitoring neonates’ health status (healthy/unhealthy) by focusing on dedicated regions | 38 neonates 3800 Neonatal thermograms | Image | 95% (accuracy) | +Understanding how VGG16 decides on neonatal thermograms | – |
Ervural et al., 2021190 | Data Augmentation and CNN | Detect health status of neonates | 44 neonates 880 images Neonatal thermograms | Image | 62.2% to 94.5% (accuracy) | +Significant results with data augmentation | -Less clinically applicable -Small dataset |
Ervural et al., 2021191 | Deep Siamese Neural Network (D-SNN) | Prediagnosis to experts in disease detection in neonates | 67 neonates, 1340 images Neonatal thermograms | Image | 99.4% (infection diseases accuracy), 96.4% (oesophageal atresia accuracy), 97.4% (intestinal atresia accuracy), 94.02% (necrotizing enterocolitis accuracy) | +D-SNN is effective in the classification of neonatal diseases with limited data | -Small sample size |
Ceschin et al., 2018188 | 3D CNNs | Automated classification of brain dysmaturation from neonatal MRI in CHD | 90 term-born neonates with congenital heart disease and 40 term-born healthy controls | Image | 98.5% (accuracy) | +3D CNN on a small sample size showing excellent performance using cross-validation +Cerebellar dysplasia in CHD patients | -Small sample size |
Ding et al., 2020169 | HyperDense-Net and LiviaNET | Neonatal brain segmentation | 40 neonates 24 for training 16 for experiment 3T Brain MRI T1 and T2 | Image | 94%, 95%, 92% (Dice Score) 90%, 90%, 88% (Dice Score) | +Both neural networks can segment neonatal brains, achieving previously reported performance | -Small sample size |
Liu et al., 202099 | Graph Convolutional Network (GCN) | Brain age prediction from MRI | 137 preterm 1.5-Tesla MRI Bayley-III Scales of Toddler Development at 3 years | Image | Show the GCN’s superior prediction accuracy compared to state-of-the-art methods | +The first study that uses GCN on brain surface meshes to predict neonatal brain age | -No clinical information |
Hyun et al., 2016155 | NLP and CNN (AlexNet and VGG16) | Classifying and annotating neonatal brain ultrasound scans using NLP and CNN | 2372 de-identified NS reports 11,205 NS head images | Image | 87% (AUC) | +Automated labeling | -No clinical variable |
Kim et al., 2022157 | CNN (VGG16) | Transfer learning | Assess whether a CNN can be trained via transfer learning to diagnose germinal matrix hemorrhage on head ultrasound | 400 head ultrasounds (200 with GMH, 200 without hemorrhage) | Image | 92% (AUC) | +First study to evaluate GMH with grade and saliency map +Not confirmed with MRI or labeling by radiologists |
Li et al., 2021159 | ResU-Net | Diffuse white matter abnormality (DWMA) on VPI’s MR images at term-equivalent age | 98 VPI 28 VPI 3 Tesla Brain MRI T1 and T2 weighted | Image | 87.7% (Dice Score), 92.3% (accuracy) | +Developed to segment diffuse white matter abnormality on T2-weighted brain MR images of very preterm infants +3D ResU-Net model achieved better DWMA segmentation performance than multiple peer deep learning models | -Small sample size -Limited clinical information |
Greenbury et al., 2021170 | Agnostic, unsupervised ML Dirichlet Process Gaussian Mixture Model (DPGMM) | Understanding nutritional practice in neonatal intensive care | 45,679 patients over a six-year period in the UK National Neonatal Research Database (NNRD) EHR | Non-Image | Clustering on time analysis on daily nutritional intakes for extremely preterm infants born <32 weeks gestation | +Identifying relationships between nutritional practice and exploring associations with outcomes +Large national multi-center dataset | -Strong likelihood of multiple interactions between nutritional components that could be utilized in records |
Ervural et al., 2021192 | CNN Data augmentation | Detect respiratory abnormalities of neonates using limited thermal images | 34 neonates 680 images 2060 thermal images (11 testing, 23 training) Thermal camera image | Image | 85% (accuracy) | +CNN model and data enhancement methods used to determine respiratory system anomalies in neonates | -Small sample size -No follow-up and no clinical information |
Wang et al., 2018174 | DCNN | Classify and grade retinal hemorrhage automatically | 3770 newborns with retinal hemorrhage and normal controls 48,996 digital fundus images | Image | 97.85% to 99.96% (accuracy), 98.9% to 100% (AUC) | +First study to show that a DCNN can detect and grade neonatal retinal hemorrhage | – |
Brown et al., 2018171 | DCNN | Develop and test an algorithm to diagnose plus disease from retinal photographs | 5511 retinal photographs (trained) and independent set of 100 images Retinal images | Image | 94% (AUC), 98% (AUC) | +Outperforming 6 of 8 ROP experts +Completely automated algorithm detected plus disease in ROP with the same or greater accuracy as human doctors | -Disease detection, monitoring, and prognosis in ROP-prone neonates -No clinical information and no clinical variables |
**Wang et al., |
Machine Learning Utilizations in Neonatal Mortality: A Comprehensive Overview
Neonatal mortality stands as a significant contributor to overall child mortality, representing 47 percent of deaths in children under the age of five, according to the World Health Organization60. The imperative to reduce global infant mortality by 203061 underscores the urgency of addressing this issue.
Machine Learning (ML) has been applied to investigate infant mortality, its determinants, and predictive modeling62,63,64,65,66,67,68. A recent study enrolled 1.26 million infants, predicting mortality as early as 5 minutes and as late as 7 days using an array of models, predominantly neural networks, random forests, and logistic regression (58.3%)67. While several studies reported favorable results, including AUC ranging from 58.3% to 97.0%, challenges such as small sample sizes and lack of dynamic parameter representation hindered broader clinical applicability67. Notably, gestational age, birth weight, and APGAR scores emerged as pivotal variables64,72. Future research recommendations emphasize external validation, calibration, and integration into healthcare practices67.
Neonatal sepsis, encompassing early and late onset, remains a formidable challenge in neonatal care, prompting ML applications for early detection. Studies predicted early sepsis using heart rate variability and clinical biomarkers with accuracies ranging from 64% to 94%74,75.
Advancements in neonatal healthcare have reduced severe prenatal brain injury incidence but underscored the need to predict neurodevelopmental outcomes. ML methods, including brain segmentation, connectivity analysis, and neurocognitive evaluations, have been employed to address this. Additionally, ML aids in neuromonitorization, such as automatic seizure detection from EEG and analyzing EEG biosignals in infants with hypoxic-ischemic encephalopathy (HIE)104,105,106,107,108.
ML applications extend to predicting complications in preterm infants, including Patent Ductus Arteriosus (PDA), Bronchopulmonary Dysplasia (BPD), and Retinopathy of Prematurity (ROP). PDA detection from electronic health records (EHR) and auscultation records demonstrated accuracies of 76% and 74%, respectively123,124. ML studies predicted BPD with accuracies up to 86%, and other research aimed to predict complications related to long-term invasive ventilation128,129,130.
ROP, a leading cause of childhood blindness, has seen ML applications in diagnosing and classifying from retinal fundus images, with systems achieving up to 95% accuracy132,133,134.
ML also finds utility in the diagnosis of various neonatal diseases, utilizing EHR and medical records for conditions like congenital heart defects, HIE, IVH, neonatal jaundice, NEC, and predicting rehospitalization135,136,84,85,137,138,139,142,143.
Furthermore, ML has been applied to analyze electronically captured physiologic data for artifact detection, late-onset sepsis prediction, and overall morbidity evaluation144,145,146.
In addressing metabolic disorders of newborns, ML methods, especially Support Vector Machines (SVM), have been employed for conditions like methylmalonic acidemia (MMA), phenylketonuria (PKU), and medium-chain acyl CoA dehydrogenase deficiency (MCADD)151,152,153. Notably, ML contributes to improving the positive predictive value in newborn screening programs for these disorders152.
In summary, ML applications span a wide spectrum in neonatal care, from predicting mortality and sepsis to assessing neurodevelopmental outcomes and complications in preterm infants, showcasing its diverse and impactful role in improving neonatal healthcare.
Deep Learning Advancements in Neonatology
Deep Learning in clinical image analysis serves three primary purposes: classification, detection, and segmentation. Classification focuses on identifying specific features in an image, detection involves locating multiple features within an image, and segmentation entails dividing an image into multiple parts7,9,154,155,156,157,158,159,160.
AI-Enhanced Neuroradiological Assessment in Neonatology
Neonatal neuroimaging plays a crucial role in identifying early signs of neurodevelopmental abnormalities, enabling timely intervention during a period of heightened neuroplasticity and rapid cognitive and motor development. The application of Deep Learning (DL) methods enhances the diagnostic process, providing earlier insights than traditional clinical signs would indicate.
However, imaging an infant’s brain using Magnetic Resonance Imaging (MRI) poses challenges due to lower tissue contrast, regional heterogeneity, age-related intensity variations, and the impact of partial volume effects. To address these issues, specialized computational neuroanatomy tools tailored for infant-specific MRI data are under development. The typical pipeline for predicting neurodevelopmental disorders from infant structural MRI involves image preprocessing, tissue segmentation, surface reconstruction, and feature extraction, followed by AI model training and prediction.
Segmenting a newborn’s brain is particularly challenging due to decreased signal-to-noise ratio, motion restrictions, and the smaller brain size. Various non-DL-based approaches, including parametric, classification, multi-atlas fusion, and deformable models, have been proposed for newborn brain segmentation. The evaluation metric, Dice Similarity Coefficient, measures segmentation accuracy.
The future of neonatal brain segmentation research involves developing more sophisticated neural segmentation networks. Despite advancements, the slow progress in the field of artificial intelligence in neonatology is attributed to a lack of open-source algorithms and limited datasets.
Further research should focus on enhancing DL accuracy in diagnosing conditions such as germinal matrix hemorrhage. Comparisons between DL and sonographers in identifying suspicious studies, grading hemorrhages accurately, and improving diagnostic capabilities of head ultrasound in diverse clinical scenarios warrant attention.
The evaluation of prematurity complications using DL in neonatology encompasses various applications, including disease prediction, MR image analysis, combined EHR data analysis, and predicting neurocognitive outcomes and mortality. DL proves effective in detecting conditions like PDA, IVH, BPD, ROP, and retinal hemorrhage. Additionally, DL contributes to treatment planning, NICU discharge, personalized medicine, and follow-up care.
DL’s potential in ROP screening programs is notable, offering cost-effective solutions for detecting severe cases that require therapy. Studies show DL outperforming experts in diagnosing plus disease and quantifying the clinical progression of ROP. DL applications extend to sleep protection in the NICU, real-time evaluation of cardiac MRI for congenital heart disease, classification of brain dysmaturation from neonatal brain MRI, and disease classification from thermal images.
Two groundbreaking studies emphasize the impact of DL on nutrition practices in NICU and the use of wireless sensors. ML techniques, unbiased and data-driven, showcase the potential to bring about clinical practice changes and improve monitoring, preventing iatrogenic injuries in neonatal care.
Discussion
The studies in neonatology involving AI were systematically categorized based on three primary criteria:
(i) Whether the studies utilized Machine Learning (ML) or Deep Learning (DL) methods,
(ii) Whether imaging data or non-imaging data were employed in the studies, and
(iii) According to the primary aim of the study, whether it focused on diagnosis or other predictive aspects.
In the pre-Deep Learning era, the majority of neonatology studies were conducted using ML methods. Specifically, we identified 12 studies that utilized ML along with imaging data for diagnostic purposes. Furthermore, 33 studies employed non-imaging data for diagnostic applications. The spectrum of imaging data studies covered diverse areas such as biliary atresia (BA) diagnosis based on stool color, postoperative enteral nutrition for neonatal high intestinal obstruction, functional brain connectivity in preterm infants, retinopathy of prematurity (ROP) diagnosis, neonatal seizure detection from video records, and newborn jaundice screening. Non-imaging studies for diagnosis included congenital heart defects diagnosis, baby cry analysis, inborn metabolic disorder diagnosis and screening, hypoxic-ischemic encephalopathy (HIE) grading, EEG analysis, patent ductus arteriosus (PDA) diagnosis, vital sign analysis, artifact detection, extubation and weaning analysis, and bronchopulmonary dysplasia (BPD) diagnosis.
In contrast, studies involving Deep Learning applications were less prevalent compared to Machine Learning. DL studies focused on brain segmentation, intraventricular hemorrhage (IVH) diagnosis, EEG analysis, neurocognitive outcome prediction, and PDA and ROP diagnosis. While the DL field is expected to witness increased research in upcoming articles, it is noteworthy that there have been several articles and studies on the application of AI in neonatology. However, many of these lack sufficient details, making it challenging to evaluate and compare them comprehensively, thus limiting their utility for clinicians.
Several limitations exist in the application of AI in neonatology, including the absence of prospective design, challenges in clinical integration, small sample sizes, and evaluations limited to single centers. DL has demonstrated potential in extracting information from clinical images, bioscience, and biosignals, as well as integrating unstructured and structured data in Electronic Health Records (EHR). However, key concerns related to DL in medicine include difficulties in clinical integration, the need for expertise in decision mechanisms, lack of data and annotations, insufficient explanations and reasoning capabilities, limited collaboration efforts across institutions, and ethical considerations. These challenges collectively impact the success of DL in the medical domain and are categorized into six components for further examination.
Challenges in Integrating Clinical Practices
“Challenges in Integrating AI into Neonatal Healthcare: A Perspective on Clinical Trials
Despite the significant advancements in AI accuracy within the healthcare domain, translating these achievements into practical treatment pathways faces multiple hurdles. One major concern among physicians is the lack of well-established randomized clinical trials, particularly in pediatrics, demonstrating the reliability and enhanced effectiveness of AI systems compared to traditional methods for diagnosing neonatal diseases and recommending suitable therapies. Comprehensive discussions on the pros and cons of such studies are presented in tables and relevant sections. Current research predominantly focuses on imaging-based or signal-based investigations, often centered around a specific variable or disease. Neonatologists and pediatricians express the need for evidence-based algorithms with proven efficacy. Remarkably, there are only six prospective clinical trials in neonatology involving AI. One notable trial, supported by the European Union Cost Program, explores the detection of neonatal seizures using conventional EEG in the NICU197. Another study investigates the physiological effects of music in premature infants208, though it does not employ AI analysis. A recent trial, “Rebooting Infant Pain Assessment: Using Machine Learning to Exponentially Improve Neonatal Intensive Care Unit Practice (BabyAI),” is currently recruiting209. Another ongoing study aims to collect real-time data on pain signals in non-verbal infants using sensor-fusion and machine learning algorithms210. However, no results have been submitted yet. Similarly, the “Prediction of Extubation Readiness in Extreme Preterm Infants by the Automated Analysis of Cardiorespiratory Behavior: APEX study” completed recruitment with 266 infants, but results are pending211. In summary, there is a notable scarcity of prospective multicenter randomized AI studies with published results in the neonatology field. Addressing this gap requires planning clinically integrated prospective studies that incorporate real-time data collection, considering the rapidly changing clinical circumstances of infants and the inclusion of multimodal data with both imaging and non-imaging components.”
Requisite Expertise in Decision-Making Mechanisms
“In the realm of neonatology, considering whether to heed a system’s recommendation may necessitate the presentation of corroborative evidence95,96,125,202. Many proposed AI solutions in the medical domain are not meant to supplant the decision-making or expertise of physicians but rather serve as valuable aids. In the challenging landscape of neonatal survival without sequelae, AI could revolutionize neonatology. The diverse spectrum of neonatal diseases, coupled with varying clinical presentations based on gestational age and postnatal age, complicates accurate diagnoses for neonatologists. AI holds the potential for early disease detection, offering crucial assistance to clinicians for prompt responses and favorable therapeutic outcomes.
Neonatology involves collaborative efforts across multiple disciplines in patient management, presenting an opportunity for AI to achieve unprecedented levels of efficacy. With increased resources and support from physicians, AI could make significant contributions to neonatology. Collaboration extends to various pediatric specialties, such as perinatology, pediatric surgery, radiology, pediatric cardiology, pediatric neurology, pediatric infectious disease, neurosurgery, cardiovascular surgery, and other subspecialties. These multidisciplinary workflows, involving patient follow-up and family engagement, could benefit from AI-based predictive analysis tools to address potential risks and neurological issues. AI-supported monitoring systems, capable of analyzing real-time data from monitors and detecting changes simultaneously, would be valuable not only for routine Neonatal Intensive Care Unit (NICU) care but also for fostering “family-centered care”212,213 initiatives. While neonatologists remain central to decision-making and communication with parents, AI could actively contribute to NICU practices. Hybrid intelligence offers a platform for monitoring both abrupt and subtle clinical changes in infants’ conditions.
The limited understanding of Deep Learning (DL) among many medical professionals poses a challenge in establishing effective communication between data scientists and medical specialists. A considerable number of medical professionals, including pediatricians and neonatologists, lack familiarity with AI and its applications due to limited exposure to the field. However, efforts are underway to bridge this gap, with clinicians taking the lead in AI initiatives through conferences, workshops, courses, and even coding schools214,215,216,217,218.
Looking ahead, neonatal critical conditions are likely to be monitored by human-in-the-loop systems, and AI-empowered risk classification systems may assist clinicians in prioritizing critical care and allocating resources precisely. While AI cannot replace neonatologists, it can serve as a clinical decision support system in the dynamic and urgent environment of the NICU, calling for prompt responses.”
Challenges Stemming from Insufficient Imaging Data, Annotations, and Reproducibility Issues
“There is a growing interest in leveraging deep learning methodologies for predicting neurological abnormalities using connectome data; however, their application in preterm populations has been restricted. Similar to most deep learning (DL) applications, these models often necessitate extensive datasets. Yet, obtaining large neuroimaging datasets, especially in pediatric settings, remains challenging and costly. DL’s success depends on well-labeled, high-capacity models trained with numerous examples, posing a significant challenge in the realm of neonatal AI applications.
Accurate labeling demands considerable physician effort and time, exacerbating the existing challenges. Unfortunately, there is a lack of large-scale collaboration between physicians and data scientists that could streamline data gathering, sharing, and labeling processes. Overcoming these challenges holds the promise of utilizing DL in prevention and diagnosis programs, fundamentally transforming clinical practice. Here, we explore the potential of DL in revolutionizing various imaging modalities within neonatology and child health.
The need for a massive volume of data is a significant impediment, especially with the increasing sophistication of DL architectures. However, collecting a substantial amount of clean, verified, and diverse data for various neonatal applications is challenging. Data augmentation techniques and building models with shallow networks are proposed solutions, though they may not be universally applicable. Additionally, issues arise in generalizing models to new data, especially considering variations in MRI contrasts, scanners, and sequences between institutions. Continuous learning strategies are suggested to address this challenge.
Most studies lack open-source algorithms and fail to clarify validation methods, introducing methodological bias. Reproducibility becomes a crucial concern for comparing algorithm success. Furthermore, the lack of explanations and reasoning in widely used DL models poses a risk, especially in high-stakes medical settings. Trustworthiness is imperative for the widespread adoption of AI in neonatology.
Collaboration efforts across multiple institutions face privacy concerns related to cross-site sharing of imaging data. Federated learning has been proposed to address privacy issues, but data heterogeneity may impact model efficacy. Ethical concerns, including informed consent, bias, safety, transparency, patient privacy, and allocation, add complexity to health AI. The implementation of an ethics framework in neonatology AI is yet to be reported.
Despite these challenges, the potential benefits of AI in healthcare are substantial, including increased speed, cost reduction, improved diagnostic accuracy, enhanced efficiency, and increased access to clinical information. AI’s impact on neonatal intensive care units and healthcare, while promising, requires clinicians’ support, emphasizing the need for collaboration between AI researchers and clinicians for successful implementation in neonatal care.”
Approaches Employed
“Review of Literature and Search Methodology
We conducted a comprehensive literature search using databases such as PubMed™, IEEEXplore™, Google Scholar™, and ScienceDirect™ to identify publications related to the applications of AI, ML, and DL in neonatology. Our search strategy involved various combinations of keywords, including technical terms (AI, DL, ML, CNN) and clinical terms (infant, neonate, prematurity, preterm infant, hypoxic ischemic encephalopathy, neonatology, intraventricular hemorrhage, infant brain segmentation, NICU mortality, infant morbidity, bronchopulmonary dysplasia, retinopathy of prematurity). The inclusion criteria were publications dated between 1996 and 2022, focusing on AI in neonatology, written in English, published in peer-reviewed journals, and objectively assessing AI applications in neonatology. Exclusions comprised review papers, commentaries, letters to the editor, purely technical studies without clinical context, animal studies, statistical models like linear regression, non-English language studies, dissertation theses, posters, biomarker prediction studies, simulation-based studies, studies involving infants older than 28 days, perinatal death studies, and obstetric care studies. The initial search yielded around 9000 articles, from which 987 relevant research papers were identified through careful abstract examination (Fig. 4). Ultimately, 106 studies meeting our criteria were selected for inclusion in this systematic review (see Supplementary file). The evaluation covered diverse aspects, including sample size, methodology, data types, evaluation metrics, as well as the strengths and limitations of the studies (Tables 2–7).
Availability of Data
Dr. E. Keles and Dr. U. Bagci have complete access to all study data, ensuring data integrity and accuracy. All study materials are accessible upon reasonable request from the corresponding author.”
Leave a Reply