
A team of researchers based in Japan recently published findings of their research. They set out to answer the question of whether humans or artificial intelligence systems perform better at using CTG data to predict which babies will be born with “perinatal asphyxia”. The team had previously developed an artificial intelligence model for CTG interpretation that was trained using both deep learning and machine learning approaches. (No – I don’t know that the difference between the two is, just that there is one!) These approaches goes beyond telling the computer to look out for the sort of CTG features that we humans have been trained to look at. The computer is essentially given free reign to find other patterns in the data.
What did the researchers do?
First, they collected CTG data from women who gave birth in Japan between 2013 and 2017, along with clinical information about the outcomes for the baby. The data came from a central fetal monitoring system called Toitu. (I’d love to know whether women consented for the use of their data for research purposes or not.) A 30 minute segment of CTG occurring up to 20 minutes prior to the birth was analysed for the study. The final data set used 489 CTGs from women with one baby, born after 34 weeks of gestation. Of these, 31 women had babies with “perinatal asphyxia” – a relatively small data set. “Perinatal asphyxia” was defined as an Apgar score of less than 6 at 1 minute or 5 minutes of age, or an umbilical artery pH of under 7.1. It’s a slightly odd choice as it isn’t the commonly used definition for perinatal asphyxia. Low Apgar scores and cord blood acidosis are only weakly associated with clinically relevant perinatal outcomes (Dalili et al., 2016; Johnson et al, 2021; Leinonen et al, 2019). I would have liked to see them use a more clinically relevant end point – like encephalopathy; or at least to use the same definition of perinatal asphyxia that others do.
All the 489 CTGs were analysed by the artificial intelligence algorithm (using either a deep learning or a machine learning approach) and by a panel of clinicians. This panel consisted of 34 obstetricians and 22 midwives, who were each asked to review 50 CTGs (10 with “asphyxia” and 40 without) and answer one question: would this baby have “perinatal asphyxia”? Neither the clinicians or the computer system were given additional clinical information – just the CTG.
What did they find?
The machine learning algorithm approach had a very similar negative predictive value to the deep learning algorithm – that is the ability of a CTG recording to identify a baby who does not have” asphyxia”. When the CTG was interpreted as normal, 93.48% (machine learning) and 93.98% of babies were unaffected. Clinicians performed ever so slightly better here – at 94.61%. While the negative predictive value is high for all – this still translates into about one in 20 babies with a “normal” CTG pattern in the period immediately before being born with asphyxia.
Accuracy rates were slightly lower – that is the ability of an CTG recording to identify a baby who does have “asphyxia”. Humans correctly identified 87.53% of affected babies, the machine learning algorithm 88.14%, and deep learning 78.12%. For human interpretation, this means that for every 100 CTG patterns that was quite abnormal, 12 of the babies would be born in good condition.
What would happen in real life however, is that a human would look at the CTG when the computer flagged it as abnormal, adding their own interpretation over the top of the one the computer offered. So the researchers also looked at what happened when both human AND direct learning interpretation were combined. The accuracy (the ability to predict who would have “asphyxia”) improved slightly to 92.43% but the negative predictive value (the ability to tell who would not have “asphyxia”) remained about the same at 94.13%.
So what does this mean?
On the face-of it, this research provides reassurance that midwives and obstetricians still do a better job than computers when it comes to CTG interpretation. BUT – there are some important limitations to bear in mind when you think about these findings.
First, they only apply to women with a singleton pregnancy who are more than 34 weeks pregnant, giving birth in a context of care similar to that of Japan. We aren’t given any clinical information about why CTG monitoring was in use, so the findings might relate only to women with particular clinical situations, and not to women with uncomplicated pregnancies. While the authors make no mention of the timing of cord clamping, my assumption is that this was done before the establishment of respiration. So, the findings might not apply when optimal cord clamping is used. And finally, the study is small in size and uses surrogate markers that are themselves not good predictors of poor clinical outcomes.
The biggest concern I have about the design of this study was the exclusion of clinical data from both the algorithm and the decision-making by clinicians. All maternity professionals know (or should know!) the importance of “looking at the big picture” when interpreting the CTG. This study doesn’t help us to know how either computer interpretation or human interpretation would perform in the real life situation where a wealth of other data are factored into decision-making.
The authors concluded that “these findings suggest that artificial intelligence can assist in reducing human errors and false-positive rates, potentially leading to safer delivery outcomes“. They also said that “the development of CTG monitoring during delivery using automated devices, which are able to ameliorate the shortage of human resources or prevent misinterpretation, is required in both situations to promote neonatal welfare“. I think it is pushing the envelope too far to make either of these claims on the basis of the findings of this study, particularly in light of the absence of clinical information in the decision-making.
The take home message
We need to resist the lure of the argument that good clinicians are hard to come by, expensive, and well – humanly messy and demanding; and therefore the easier solution is to replace them with machines. This research shows that even with no clinical information available, people still marginally outperformed the machine. It is possible that humans would vastly outperform the algorithm in a typical clinical environment were all relevant clinical information was available.
We need midwives and nurses and doctors in all maternity services. Not just for fetal heart rate monitoring but for ALL the other varieties of ways they make maternity care better for women (bread and roses too!). We must stand up and continue to fight for governments to provide workplaces that nurture their maternity staff, rather than seek to replace them with machines.
Sign Up for the BirthSmallTalk Newsletter and Stay Informed!
Want to stay up-to-date with the latest research and course offers? Our monthly newsletter is here to keep you in the loop.
By subscribing to the newsletter, you’ll gain exclusive access to:
- Exciting Announcements: Be the first to know about upcoming courses. Stay ahead of the curve and grab your spot before anyone else!
- Exclusive Offers and Discounts: As a valued subscriber, you’ll receive special discounts and offers on courses. Don’t miss the chance to save money while investing in your knowledge development.
Join the growing community of BirthSmallTalk folks by signing up for the newsletter today!
Sign up to the Newsletter

References
Alfirevic, Z., Devane, D., Gyte, G. M. L., & Cuthbert, A. (2017, Feb 03). Continuous cardiotocography (CTG) as a form of electronic fetal monitoring (EFM) for fetal assessment during labour. Cochrane Database of Systematic Reviews, 2(CD006066), 1-137. https://doi.org/10.1002/14651858.CD006066.pub3
Dalili H, Sheikh M, Hardani AK, Nili F, Shariat M, Nayeri F. (2016). Comparison of the Combined versus Conventional Apgar Scores in Predicting Adverse Neonatal Outcomes. PLoS ONE, 11(2), e0149464. https://doi.org/10.1371/journal.pone.0149464
Johnson, G. J., Salmanian, B., Denning, S. G., Belfort, M. A., Sundgren, N. C., & Clark, S. L. (2021, Sep 1). Relationship between umbilical cord gas values and neonatal outcomes: Implications for electronic fetal heart rate monitoring. Obstetrics & Gynecology, 138(3), 366-373. https://doi.org/10.1097/AOG.0000000000004515
Leinonen, E., Gissler, M., Haataja, L., Andersson, S., Rahkonen, P., Rahkonen, L., & Metsäranta, M. (2019, Apr 07). Umbilical artery pH and base excess at birth are poor predictors of neurodevelopmental morbidity in early childhood. Acta Paediatrica, 108(10), 1801-1810. https://doi.org/10.1111/apa.14812
Miyata, K., Shibata, C., Fukunishi, H., Hemmi, K., Kinoshita, H., Hirakawa, T., Urushiyama, D., Kurakazu, M., & Yotsumoto, F. (2025). Cardiotocography-Based Experimental Comparison of Artificial Intelligence and Human Judgment in Assessing Fetal Asphyxia During Delivery. Cureus, 17(1), e78282. https://doi.org/10.7759/cureus.78282
- Are decision support systems working in maternity care?
- I’m having a VBAC. Do I need fetal monitoring?
Categories: CTG, EFM, New research
Tags: acidosis, Apgar, Artificial intelligence, computerised interpretation, machine learning, perinatal asphyxia, pH