Birth Small Talk

Fetal monitoring information you can trust

Would you let ChatGPT interpret your CTG?

What did they do?

The CTGs of sixty women with singleton pregnancies between 37 and 42 weeks of gestation, of at least 30 minutes duration where selected. (The authors don’t say whether these were made during pregnancy or labour.) All had been interpreted at the time by a clinician using the FIGO guideline. Four different LLMs were assessed: ChatGPT v4, Gemini v2, Bing Copilot, and DeepSeek. Each was provided with identical prompts on what to assess. No clinical information was provided.

Two phases of assessment were then conducted. In the first, thirty of the CTGs were present to the LLMs. All had been classified as “normal” by expert clinicians. In the second phase, only the LLMs that showed some potential were presented with CTGs, this time with thirty that were considered as suspicious or pathological.

What did they find?

DeepSeek was unable to process the CTG traces at all. Gemini performed particularly poorly, categorising 28 of the 30 normal traces as non-reassuring. ChatGPT was better but still left a lot to be desired – calling 16 of the 30 suspicious. Bing CoPilot performed the best, identifying 29 of the 30 as normal.

On the second run with abnormal CTG traces, Bing CoPilot failed miserably however – identifying 24 as normal and 6 as uninterpretable. ChatGPT was only able to identify 15 as abnormal.

Sigh…

The authors started this paper with a lie, saying:

By identifying early signs of hypoxia, CTG aims to guide timely interventions that reduce the risk of adverse perinatal outcomes such as hypoxic-ischemic encephalopathy and stillbirth.

They cited the Grivell et al, 2015 Cochrane review on antenatal CTGs. The one that showed a statistically non-significant but clinically worrying increase in perinatal mortality with CTG use in the antenatal period. (Seriously, my fellow peer reviewers – this should not have got through to publication unchallenged!) So we were never off to a great start here….

It is clear that attempting to use artificial intelligence systems designed primarily for language use to interpret fetal heart rate data is a very bad idea. This is what is known in my family as Nan Jan research, after my mum. She would often say, “well I could have told them that, and saved them all that money!”.

The take away message here is – don’t try this one at your workplace. Use your own knowledge about heart rate interpretation and do NOT ask ChatGPT (or some other LLM).


Sign Up for the BirthSmallTalk Newsletter and Stay Informed!

Want to stay up-to-date with the latest research and course offers? Our monthly newsletter is here to keep you in the loop.

By subscribing to the newsletter, you’ll gain exclusive access to:

  • Exciting Announcements: Be the first to know about upcoming courses. Stay ahead of the curve and grab your spot before anyone else!
  • Exclusive Offers and Discounts: As a valued subscriber, you’ll receive special discounts and offers on courses. Don’t miss the chance to save money while investing in your knowledge development.

Join the growing community of BirthSmallTalk folks by signing up for the newsletter today!

Sign up to the Newsletter

References

Psilopatis, I., Monod, C., Filippi, V., Tschudin, R., Lapaire, O., Emons, J., Mosimann, B., & Zwimpfer, T. (2025). A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria. Archived of Gynecology & Obstetrics (2025). https://doi.org/10.1007/s00404-025-08145-w

Categories: CTG, New research

Tags: , , , , , ,

2 replies

  1. Juan-Maria Adelantado's avatar

    Dear Kirsten, It is not a matter of interpretation it’s a matter of the concept and this is not changed. Joan M

    Like

Leave a reply to Juan-Maria Adelantado Cancel reply