Failing at CTG interpretation

The CTG is on and a recording is being generated. Decisions will soon be made and actions taken. In between these two is the act of CTG interpretation. If good decisions are to inform appropriate action, then CTG interpretation needs to be accurate. So, how good are clinicians at CTG interpretation?

Researchers have looked at this question. I’ve posted some of this research before (here). Engelhart and colleagues (2023) have recently published a superb and comprehensive review of all the published research (49 articles in total) about the reliability of CTG interpretation.

There are two mains ways researchers have used to look at the accuracy of CTG interpretation. The first and most commonly used approach is called inter-observer (or inter-rater) reliability. To measure this you give a set of CTG traces to person A, and the same set to person B. Once they have interpreted the CTGs, you compare their interpretations to see whether they agree with each other or not. Engelhart and colleagues identified studies where in total 577 people assessed 6315 CTG tracings the way.

The other approach is to give a set of traces to person A, wait a while, and then give the same set of traces to person A again. You then compare their first interpretation with their second to see if they are consistent. This is intra-observer (or intra-rater) reliability. Engelhart and colleagues collected studies where at least 123 people (one study didn’t say how many there were) assessed 1170 tracings twice.

What did they find?

The common approach to literature reviewing is to pool all the results from the studies together, reanalyse them, and come up with one set of numbers to describe the findings from them all. This is know as meta-analysis. It works best when the studies are done in very similar ways. That wasn’t the case with this body of research, so they decided not to do this. Instead, they described the range of findings from across the studies.

The research examined how accurate raters were when deciding what the baseline was, what the variability of the heart rate was, whether there were accelerations or decelerations present, what type of decelerations were present, along with an overall category for the trace (normal, or some variety of not normal, depending on the terminology set out in the guideline being applied in the research). People were better at agreeing what the baseline was, than they were at agreeing on variability, decelerations, or the overall classification. They were also better at agreeing when they considered the trace to be normal than when the trace was not. The range of agreement (mostly measured using a thing called a kappa coefficient) varied from worse than pulling a choice out of a hat and using that, to near perfect agreement.

Intra-rater agreement was better than inter-rater agreement. This makes intuitive sense. I call that pattern the same thing more or less each time I see it, and you call it the same thing more or less each time you see it, but you and I may not agree on what to call that same heart rate pattern. Research has also looked to see whether experience or profession had an impact on levels of agreement – and there were no clear answers here. Training sessions did increase the level of agreement when before and after studies have been done.

The research team also looked for research on the reliability of interpretation of intermittent auscultation, and found none.

What does this all mean?

The authors of this paper were pretty blunt when they summed up their advice on the basis of their findings, saying:

This implies that intrapartum CTG should be used with caution for clinical decision making given its questionable reliability.
p 13

I have two main thoughts when it comes to the evidence about the reliability (or otherwise) of CTG traces. The first relates to the concept of “fresh eyes” and the second is about who we hold responsible for the lack of agreement.

Why “fresh eyes” checks don’t make sense

The assumption behind the “fresh eyes” concept is that having another maternity professional interpret the same section of CTG trace as the professional providing direct care is a good thing. It is intended to reduce the possibility that someone will miss signs that suggest the fetus will benefit from prompt birth. There has been no research to establish that this is what happens when you get another person to look at the same trace. I suspect that it doesn’t do this at all.

In research settings, CTG interpretation was likely done during daylight hours, alone, and without interruption (something the authors of this paper acknowledged). In the clinical setting, what is more likely is that two or more people are standing in front of the central fetal monitoring system at 2 am, tossing the conversation back and forth, as one of them is interrupted to chart IV fluids for the woman in room 11 and another by the bed manager wanting to know how many women will need a ward bed on the morning shift. Unlike the research setting, CTG interpretation by more than one person is an inherently social process. And the social aspects that result in a decision about care are yet to be well considered in research.

Given what we know about inter-rater variability and human nature, there is no guarantee that a second or third person will come to the same conclusion about the trace as the first person did. Nor is there any guarantee that the second or third person’s interpretation leads to a more accurate assessment of fetal status. What is likely is that the interpretation of the CTG that is ultimately decided to be the “right” one relates to the relative position of each person in the hospital hierarchy and their ability to argue their corner. As this paper showed, there was no clear evidence that obstetricians or midwives on a higher pay grade were better at CTG interpretation than a newly qualified midwife. Yet this is typically what happens in practice.

Who should be held to account for the lack of agreement?

It is really easy, and common, to take the position that people are responsible for the lack of consistency in CTG interpretation. If they just tried harder, were a bit smarter, then everything would be alright! The evidence that training improves agreement helps drive this argument. What remains unclear is whether higher levels of agreement lead to better outcomes. All the people attending a CTG interpretation training course might have learned the same thing, but if that thing isn’t a reliable measure of fetal status then having high levels of agreement about it isn’t likely to be helpful.

When you consider that 49 articles published across the last 44 years fail to demonstrate consistency in interpretation and in the research findings about interpretation, it becomes clear that people are not the problem here. Low levels of agreement are a function of the CTG technology itself. We might as well be pulling a handful of runes out of a bag and using these to decide the best approach to labour management. If maternity professionals were honest with one another, and with the women we provide care for about this, different choices for fetal heart rate monitoring that the ones we currently see would be likely.

Reference

Engelhart, C. H., Brurberg, K. G., Aanstad, K. J., Pay, A. S. D., Kaasen, A., Blix, E., & Vanbelle, S. (2023, Jun 13). Reliability and agreement in intrapartum fetal heart rate monitoring interpretation: A systematic review. Acta Obstetrica et Gynecologica Scandinavica, in press. https://doi.org/10.1111/aogs.14591