WOMAN-2: What Does the Robot Say?

Following on the success of Toy Story 2, Inside Out 2, and Avatar 2, we have WOMAN-2, yet another trial featuring emergency medicine’s third-favorite medication: tranexamic acid. However, where those sequels succeeded, WOMAN-2 is more like Miss Congeniality 2 – the one we’re not going to talk about again.

But, don’t take it from me – take it from the ChatGPT Agent I created!

The WOMAN-2 trial serves as a benchmark for evaluating interventions targeting postpartum hemorrhage (PPH) in women with anemia, but it underscores the challenges of translating promising pharmacological hypotheses into clinical practice. The study meticulously implemented randomization and blinding protocols, ensuring that biases from enrolment and allocation processes were minimized. Baseline characteristics were well-matched, providing a solid foundation for evaluating the intervention’s effect.

However, the trial’s design reveals critical limitations that may have compromised its ability to detect meaningful clinical benefits. The reliance on clinical judgment rather than standardized measurement tools for diagnosing PPH introduces variability, potentially diluting the observed treatment effect. Furthermore, the timing of tranexamic acid administration—after cord clamping—raises questions about its pharmacological plausibility in rapidly evolving hemorrhagic scenarios. Subgroup analyses suggest the intervention’s effect may differ in high-risk populations, such as those with antepartum hemorrhage, but the evidence is inconclusive.

The null findings for both primary and secondary outcomes should prompt reflection on trial methodology and clinical priorities. The absence of adverse events is reassuring, yet the lack of demonstrable efficacy highlights the complexity of addressing PPH in anemic populations. A focus on earlier interventions and systemic prevention of anemia may hold greater promise. Additionally, future trials should prioritize precision in outcome measurement and incorporate emerging technologies to minimize subjective assessment variability.

While the study’s funding sources were disclosed transparently, the authors’ broader involvement in tranexamic acid research warrants attention to ensure interpretations remain unbiased. Overall, the WOMAN-2 trial reinforces the urgency of optimizing anemia management and refining therapeutic strategies for PPH, underscoring the need for innovation in both clinical practice and research methodologies.

Well, there you have it – unbiased evaluation by the robot, featuring that bland robotic voice common to all its very average, very “correct” output. Interestingly, it can be trained and/or instructed to copy your writing “style”, and the output is grossly similar – but with an added layer of tryhard treacle slathered upon it.

In my brief experimentations with the Agent, it seems clear the augmentation feasible does not include writing – at least, enjoyable writing. It is superficially very competent at enumerating questions from a template, however, such as study population, primary outcomes, and specific sources of bias. For example, this agent actually executes the ROB2 questionnaire on an RCT before using that output as the foundation for its summary paragraphs. Probably good enough to give an “at a glance” summarization, but not nearly sufficient to put the research into context.

Agent aside, we’re here because WOMAN-2 is the sequel, obviously, to WOMAN – a “positive” trial, also “negative”. In WOMAN, it was positive for the endpoint of post-partum hemorrhage and death due to bleeding, but negative for the patient-oriented outcomes of overall mortality. Here in WOMAN-2, the small effect size previously seen in WOMAN has entirely vanished, leading to further questions. Where TXA seems to be most effective are instances in which it is given early – and subsequent trials “I’M Woman” and “WOMAN-3” will address these possibilities. The other possibility is, such as with gastrointestinal bleeding, certain clinical scenarios feature specific fibrinolytic activation pathways where the mild effect of TXA simply can’t move the needle.

So, nothing here changes what most of us do in the modern world – and those who have Bayesian ideas regarding the efficacy of TXA are likely going to keep using it in sub-Saharan Africa. If you are going to keep using TXA routinely, use it early and in the highest-risk populations – as the likelihood of a clinically meaningful benefit will otherwise disappear like a whisper in the wind.

“The effect of tranexamic acid on postpartum bleeding in women with moderate and severe anaemia (WOMAN-2): an international, randomised, double-blind, placebo-controlled trial”

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(24)01749-5/fulltext

What are Children’s Lives Worth (to Save)?

This article regarding the cost of upgrading emergency departments to be “ready” for sick children has been bouncing around in the background since its publication, with some initial lay press coverage.

The general concept here is obviously laudable and the culmination of at least a decade of hard work from these authors and the team involved – with the ultimate goal of ensuring each emergency department in the country is capable of caring for critically unwell children. The gist of this most recent publication builds upon their prior work to, effectively, estimate the overall cost (~$200M) of improving “pediatric readiness”. Using that total cost, they then translate this into humanizing terms by referencing the total cost per child it might require in different states, and the number of pediatric lives saved annually.

As can be readily gleaned from this sort of thought experiment, these estimates rely upon a nested set of foundational assumptions, all of which are touched upon by prior work by this group. There are surveys of subsets of emergency departments regarding “readiness“, which involve questions such as the presence of pediatric-sized airway devices and staff dedicated to upkeep of various pediatric support. Then, they use these data and salary estimates to come up with the institutional costs of readiness. Then, they have another set of work looking at the odds ratios for increased poor outcomes at departments whose “readiness” is in the lowest percentiles, and this work is extrapolated to determine the lives saved.

Each of these pieces of work, in isolation, is reasonable, but represents a bit of a house of cards. The likelihood of imprecision is magnified as the estimates are combined. For example, how direct is the correlation between “readiness” based on certain equipment and pediatric survival, if the ED in question is a critical access hospital with low annual census? Is the cost of true clinical readiness just a part-time FTE of a nurse, or should it realistically involve the costs of skill upkeep for nurses and physicians with education or simulation?

I suspect, overall, these data understate the costs and overstate the return on investment. That said, this is still critical work even just to describe the landscape and take a stab at the scope of funding required. Likely, the best next step would be to target specific profiles of institutions, and specific types of investment, where such investment is likely to have the highest yield – as a first step on the journey towards universal readiness.

“State and National Estimates of the Cost of Emergency Department Pediatric Readiness and Lives Saved”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825748

Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395