On the Other Site …

I’ve taken the opportunity of reboot to try and make smaller, more digestible chunks to highlight what I’m reading – and to post more often.

So, check out:

It’s a Substack, but I’m not trying to milk anyone for money – don’t worry about that!

New Year, New Me?

It’s been a long haul for “Emergency Medicine Literature of Note” – and, as AI becomes a greater portion of my professional life, correspondingly, the time shrinks for a long-form Emergency Medicine post.

So, in an experiment, I’ve created – and migrated to a more-modern platform – the “Medicine Minute” Substack.

The hope is, with short format thoughts, there can be more-frequent updates – and the longer “rants” can still have either a life here, or in ACEPNow.

Speaking of which, don’t forget:

Annals of Emergency Medicine Podcast

Annals of Emergency Medicine Journal Club

Modeling the Mottled Child: Evaluating a Pediatric Septic Shock Predictive Modeling Screening Tool: February 2025 Annals of Emergency Medicine Journal Club

Let a Million Monkeys With Typewriters Do Your Quality Measure Reporting: January 2025 Annals of Emergency Medicine Journal Club

Is Broader Better? Piperacillin/Tazobactam, Cefepime, and the Risk of Harm: December 2024 Annals of Emergency Medicine Journal Club

ACEPNow

The AI Will Literally See You Now

This AI study is a fun experiment claiming to replicate the clinical gestalt generated by a physician’s initial synthesis of visual information. The ability to rapidly assess the stability and acuity of a patient is part of every experienced clinician’s refined skills – and used as a pre-test anchor for application of further diagnostic and management reasoning.

So, can AI do the same thing?

Well, “yes” and “of course not”.

In this demonstration project, these authors set up a mobile phone video camera at the foot of patients’ beds in the emergency department. Patients were instructed to perform a series of simple tasks (touch your nose, answer questions, etc.) while being recorded. Then, AI models were trained off images from these videos to predict the likelihood of admission.

The authors performed four comparisons: AI video alone, AI video + triage information (vital signs, chief complaint, age), triage information alone, and emergency severity index (ESI). In this fun demonstration, all four models were basically terrible at predicting admission (AUROCs ~0.6-0.7). But, the models incorporating video basically held their own, clearly outperforming ESI, and video + triage information was incrementally better than triage information alone.

There is very clearly nothing here suggesting this model is remotely clinical useful, or that it somehow parallels the cognitive processes of an experienced clinician. It is solely an academic exercise, though describing it as such ought not minimize the novelty of incorporating image analysis with other clinical information. As has been previously seen with other image analysis, AI models frequently trigger off image features unrelated to the clinical aspects of a case. The k-fold cross-validation used on their limited sample of 723 patients likely overfits their predictive model to their training data, leading to artificial inflation of performance. Then, “admission to hospital”, while operationally interesting, is a poor surrogate for immediate clinical needs and overall acuity. Finally, the authors also note several ethical and privacy challenges around video capture in clinical settings.

Regardless, a clever contribution to the AI clinical prediction literature.

“Hospitalization prediction from the emergency department using computer vision AI with short patient video clips”
https://www.nature.com/articles/s41746-024-01375-3

Getting Triggered By Errors in the Emergency Department

The emergency department is a place of risk and errors. Those who work in the ED are acutely aware of this, and it conjures up tremendous cognitive pressures on staff every shift.

Every ED clinician knows the most benign-appearing triage complaint may obfuscate lurking catastrophe. The vision changes that are actually an acute aortic dissection. A sore shoulder that is necrotizing fasciitis. The list goes on. If some are to be believed, hundreds of thousands are being killed each year by diagnostic errors in the ED. The reality is much lower, but still nontrivial.

But, the net effect becomes – the ED is a focus for patient safety research. In modern parlance, “diagnostic errors” become “missed opportunities for diagnosis” (MODs), and well-meaning researchers are devising further methods to shine bright lights upon our inadequacies.

This most recent publication looks at “e-Triggers” – effectively, combinations of both patient features and patient outcomes meant to retrospectively identify cohorts in which substantial numbers of patients can be found to have MODs. For example, in this paper, the authors use an “e-Trigger” modelled around posterior circulation stroke – in which the data warehouse is queried for elderly patients presenting with dizziness, at least two cerebrovascular risk factors, and whom, after initial discharge from the ED, suffered a stroke within 30 days.

When the authors dredged 8M records from the Veterans Affairs system for this, they identified 203 such instances, and manually reviewed 100 of these using a structured framework to characterize any diagnostic error present. For this “stroke” example, 47 of the 100 patients reviewed were identified to have had MODs. Per the review of records, the most common missed opportunity stemmed from inadequate physical examination and insufficient ordering of diagnostic tests. As a result, most of the patients reviewed suffered moderate or severe harms as a result of these MODs.

There is good news and bad news from this “e-Trigger” method shown here. The good news is primarily of interest to patient safety researchers, indicating this is probably a reasonable method to use for enriching populations for review to further describe the types of error occurring in specific clinical scenarios. This could lead to identification of generalizable knowledge gaps, cognitive biases, or system factors. It is also, probably, too unwieldy and labor intensive for routine punitive use targeting individual clinicians.

The bad news is primarily patient-centered. The fundamental nature of the e-Trigger structure requires a pairing of a cohort at risk and a subsequent unfortunate outcome. Thus, the harm has already reached the patient. It seems plausible suitably high-risk cohorts could be determined relatively contemporaneously, but the challenge would be finding a mechanism to detect a MOD with sufficient specificity to be deployable in clinical workflow. However, with the ability to potentially replace some previously human review steps with AI, this idea may be imminently achievable – watch this space!

“Implementation of Electronic Triggers to Identify Diagnostic Errors in Emergency Departments”
https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2827341

WOMAN-2: What Does the Robot Say?

Following on the success of Toy Story 2, Inside Out 2, and Avatar 2, we have WOMAN-2, yet another trial featuring emergency medicine’s third-favorite medication: tranexamic acid. However, where those sequels succeeded, WOMAN-2 is more like Miss Congeniality 2 – the one we’re not going to talk about again.

But, don’t take it from me – take it from the ChatGPT Agent I created!

The WOMAN-2 trial serves as a benchmark for evaluating interventions targeting postpartum hemorrhage (PPH) in women with anemia, but it underscores the challenges of translating promising pharmacological hypotheses into clinical practice. The study meticulously implemented randomization and blinding protocols, ensuring that biases from enrolment and allocation processes were minimized. Baseline characteristics were well-matched, providing a solid foundation for evaluating the intervention’s effect.

However, the trial’s design reveals critical limitations that may have compromised its ability to detect meaningful clinical benefits. The reliance on clinical judgment rather than standardized measurement tools for diagnosing PPH introduces variability, potentially diluting the observed treatment effect. Furthermore, the timing of tranexamic acid administration—after cord clamping—raises questions about its pharmacological plausibility in rapidly evolving hemorrhagic scenarios. Subgroup analyses suggest the intervention’s effect may differ in high-risk populations, such as those with antepartum hemorrhage, but the evidence is inconclusive.

The null findings for both primary and secondary outcomes should prompt reflection on trial methodology and clinical priorities. The absence of adverse events is reassuring, yet the lack of demonstrable efficacy highlights the complexity of addressing PPH in anemic populations. A focus on earlier interventions and systemic prevention of anemia may hold greater promise. Additionally, future trials should prioritize precision in outcome measurement and incorporate emerging technologies to minimize subjective assessment variability.

While the study’s funding sources were disclosed transparently, the authors’ broader involvement in tranexamic acid research warrants attention to ensure interpretations remain unbiased. Overall, the WOMAN-2 trial reinforces the urgency of optimizing anemia management and refining therapeutic strategies for PPH, underscoring the need for innovation in both clinical practice and research methodologies.

Well, there you have it – unbiased evaluation by the robot, featuring that bland robotic voice common to all its very average, very “correct” output. Interestingly, it can be trained and/or instructed to copy your writing “style”, and the output is grossly similar – but with an added layer of tryhard treacle slathered upon it.

In my brief experimentations with the Agent, it seems clear the augmentation feasible does not include writing – at least, enjoyable writing. It is superficially very competent at enumerating questions from a template, however, such as study population, primary outcomes, and specific sources of bias. For example, this agent actually executes the ROB2 questionnaire on an RCT before using that output as the foundation for its summary paragraphs. Probably good enough to give an “at a glance” summarization, but not nearly sufficient to put the research into context.

Agent aside, we’re here because WOMAN-2 is the sequel, obviously, to WOMAN – a “positive” trial, also “negative”. In WOMAN, it was positive for the endpoint of post-partum hemorrhage and death due to bleeding, but negative for the patient-oriented outcomes of overall mortality. Here in WOMAN-2, the small effect size previously seen in WOMAN has entirely vanished, leading to further questions. Where TXA seems to be most effective are instances in which it is given early – and subsequent trials “I’M Woman” and “WOMAN-3” will address these possibilities. The other possibility is, such as with gastrointestinal bleeding, certain clinical scenarios feature specific fibrinolytic activation pathways where the mild effect of TXA simply can’t move the needle.

So, nothing here changes what most of us do in the modern world – and those who have Bayesian ideas regarding the efficacy of TXA are likely going to keep using it in sub-Saharan Africa. If you are going to keep using TXA routinely, use it early and in the highest-risk populations – as the likelihood of a clinically meaningful benefit will otherwise disappear like a whisper in the wind.

“The effect of tranexamic acid on postpartum bleeding in women with moderate and severe anaemia (WOMAN-2): an international, randomised, double-blind, placebo-controlled trial”

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(24)01749-5/fulltext

Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

Publication Potpourri

Clearing the backlog of mildly interesting articles that will never get a full write-up – here’s a quick hit of the most interesting lucky 13!

“Lactated Ringer vs Normal Saline Solution During Sickle Cell Vaso-Occlusive Episodes”
A “target trial emulation” providing observational evidence supporting the superiority of lactated ringers solution over normal saline for the resuscitation of patients being admitted with sickle cell crises. However, only a small fraction of patients actually received LR, and those who did received smaller amounts overall. Frankly, it’s hard to separate these observations from the general concern euvolemic SCD patients are simply receiving far too much fluid during admission.
https://pubmed.ncbi.nlm.nih.gov/39250114/

“Empathy and clarity in GPT-4-Generated Emergency Department Discharge Letters”
The cold, unfeeling computer generates better feels than emergency clinicians. Bland treacle, no doubt, but generic bland treacle will beat the terse output of a distracted human any day. The question now is how to combine the two to create a better human, where suitable.
https://www.medrxiv.org/content/10.1101/2024.10.07.24315034v1

“Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG- enhanced vs. Human-Generated Medical Templates”
Study number n-million demonstrating retrieval-augmented generation – that is to say, using a relatively sophisticated series of prompts to an LLM – improves output and reduces “hallucinations”. In this specific instance, the LLM was effectively mimicking the output of BMJ Best Practice templates.
https://www.medrxiv.org/content/10.1101/2024.09.27.24314506v1

“Towards Democratization of Subspeciality Medical Expertise”
This is one of the AIME demonstration projects, a Google Deepmind output in which an additional layer of “conversational” training has been superimposed upon the underlying model to improve response to consultation-style questions. In this example, the conversational training is tuned to match the response style of genetic cardiology specialists from Stanford University – and the LLM content and style was arguably rated identically to the human specialists.
https://arxiv.org/abs/2410.03741

“The Limits of Clinician Vigilance as an AI Safety Bulwark”
A nice little commentary effectively articulating the elephant in the room: humans are terrible at “vigilance”. The present state of AI/LLM deployment in clinical medicine is replete with omissions and inaccuracies, and it’s not reasonable to simply trust clinicians to catch the mistakes. That said, the five suggested strategies to address vigilance seem … precarious.
https://jamanetwork.com/journals/jama/fullarticle/2816582

“Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest”
Who wins an essay competition on controversial topics in stroke and neurology – humans or LLMs? And can reviewers accurately guess which essays are human- versus LLM-generated? The answer, rather distressingly, is that reviewers mostly couldn’t distinguish between author types, and LLM composition quality was arguably higher than human.
https://pubmed.ncbi.nlm.nih.gov/39224979/

“Invasive Treatment Strategy for Older Patients with Myocardial Infarction”
The amusingly named “SENIOR-RITA” trial in which elderly patients with NSTEMI were randomized to an invasive strategy versus a medical management strategy. While this may seem odd to those in the U.S., frailty and baseline life-expectancy are typical considerations for acute care in other countries. In these frail elderly, invasive strategies reduce downstream non-fatal MI, but had no effect on cardiovascular death or the overall composite outcome.
https://pubmed.ncbi.nlm.nih.gov/39225274/

“Thrombolysis After Dabigatran Reversal for Acute Ischemic Stroke: A National Registry-Based Study and Meta-Analysis”
Just another bit from our neurology colleagues patting themselves on the back for doing backflips in order to give everyone thrombolysis. Registry data paired with a garbage-in-garbage-out systematic review and meta-analysis just amplifies the biases prevalent in the underlying practice culture.
https://pubmed.ncbi.nlm.nih.gov/39255429/

“Tenecteplase vs Alteplase for Patients With Acute Ischemic Stroke: The ORIGINAL Randomized Clinical Trial”
This trial makes a claim to be ORIGINAL, but at this point – there’s really no general question remaining whether tenecteplase is a valid alternative to alteplase. It is reasonable to test specific doses in various populations with a goal of minimizing harms, of course.
https://pubmed.ncbi.nlm.nih.gov/39264623/

“Technology-Supported Self-Triage Decision Making: A Mixed-Methods Study”
Can laypersons improve their self-triage decision-making with use of technology? This little preprint tests the Ada Health “symptom checker” app against interaction with a ChatGPT LLM and favors use of the symptom checker. Not hardly rigorous enough to discard the chatbot as a possible tool, but certainly the LLM needs more prompt engineering and/or domain-specific training than just “out of the box”.
https://www.medrxiv.org/content/10.1101/2024.09.12.24313558v1

“Class I Recalls of Cardiovascular Devices Between 2013 and 2022 : A Cross-Sectional Analysis”
A brief report looking at recalled cardiovascular devices is insufficient to make any broad conclusions, but certainly demonstrates the regulatory bar for approval is inadequate. Most did not require pre-market clinical testing, and those that did used surrogate or composite endpoints to support approval.
https://pubmed.ncbi.nlm.nih.gov/39284187/

“Effectiveness of Direct Admission Compared to Admission Through the Emergency Department: A Stepped-Wedge Cluster-Randomized Trial”
A bit of an odd question whether a patient with known requirement for admission needs to stop through the emergency department, or whether the patient can go straight to the ward. Unless a patient requires immediate resuscitation with the resources of an ED, it is very clearly appropriate for a patient to be directly admitted. That said, doing so requires new processes and practices – and this trial demonstrates such processes are feasible and safe (and almost certainly cheaper to avoid ED billing!)
https://pubmed.ncbi.nlm.nih.gov/39301600/

“Restrictive vs Liberal Transfusion Strategy in Patients With Acute Brain Injury: The TRAIN Randomized Clinical Trial”
The optimal cut-off is not known, but in this trial, the threshold for transfusion was 9g/dL – and patients randomized to the liberal transfusion strategy did better. An injured brain does not like hypotension and it does not like anemia.
https://pubmed.ncbi.nlm.nih.gov/39382241/

Let ChatGPT Guide Your Hand

This exploration of LLMs in the emergency department is a bit unique in its conceptualization. While most demonstrations of generative AI applied to the ED involve summarization of records, digital scribing, or composing discharge letters, this attempts clinical decision-support. That is to say, rather than attempting to streamline or deburden clinicians from some otherwise time-intensive task, the LLM here taps into its ability to act as a generalized prediction engine – and tries its hand at prescriptive recommendations.

Specifically, the LLM – here GPT-3.5T and GPT-4T – is asked:

  • Should this patient be admitted to the hospital?
  • Does this patient require radiologic investigations?
  • Does this patient require antibiotics?

Considering we’ve seen general LLMs perform admirably on various medical licensing examinations, ought not these tools be able to get the meat off the bone in real life?

Before even considering the results, there are multiple fundamental considerations taking this published exploration into the realm of curiosity rather than insightfulness:

  • This communication was submitted in Oct 2023 – meaning the LLMs used, while modern at the time, are debatably becoming obsolete. Likewise, the prompting methods are a bit simplistic and anachronistic – evidence has shown advantage to carefully constructed augmented retrieval instructions.
  • The LLM was fed solely physician clinical notes – specifically the “clinical history”, “examination”, and “assessment/plan”. The LLM was therefore generating responses based on, effectively, an isolated completed medical assessment of a patient. This method excludes other data present in the record (vital signs, laboratory results, etc.), while also relying upon finished human documentation for its “decision-support”.
  • The prompts – “should”/”does” – replicate the intent of the decision-support concept of the exploration, but not the retrospective nature of the content. Effectively, what ought to have been asked of the LLMs – and the clinician reviewers – was “did this patient get admitted to the hospital?” or “did this patient receive antibiotics?” It would be mildly interesting to shift the question away from a somewhat subjective value judgement to a bit of an intent inference exercise.
  • The clinician reviewers – one resident physician and one attending physician – did not much agree (73-83% agreement) on admission, radiology, and antibiotic determinations. It becomes very difficult to evaluate any sort of predictive or prescriptive intervention when the “gold standard” is so diaphanous. There is truly no accepted “gold standard” for these sorts of questions, as individual clinician admission rates and variations in practice are exceedingly wide. This is evidenced by the general inaccuracy displayed by just these two clinicians, whose own individual accuracy ranged from 74-83%, on top of that poor agreement.

Now, after scratching the tip of the methodology and translation iceberg, the results: unusable.

GPT-4T, as to expected, outperformed GPT-3.5T. But, regardless of LLM prompted, there were clear patterns of inadequacy. Each LLM was quite sensitive in its prescription of admission or radiologic evaluation – but at the extreme sacrifice of specificity, with “false positives” nearly equalling the “true positives” in some cases. The reverse was true for antibiotic prescription, with a substantial drop in sensitivity, but improved specificity. For what its worth, of course, U.S. emergency departments are general cesspools of aggressive empiric antibiotic coverage, driven by CMS regulations – so it may in fact be the LLM displaying astute clinical judgement, here. The “sepsis measure fallout gestapo” might disagree, however.

I can envision this approach is not entirely hopeless. The increasing use of LLM digital scribes is likely to improve early data available to such predictive or prescriptive models. Other structured clinical data collected by electronic systems may be incorporated. Likewise, there are other clinical notes of value potentially available, including nursing and triage documentation. I don’t hardly find this to be a dead-end idea, at all, but the limitations of this exploration don’t shed much light except to direct future efforts.

“Evaluating the use of large language models to provide clinical recommendations in the Emergency Department”
https://www.nature.com/articles/s41467-024-52415-1

All the Pregnant Men in the EHR

Electronic health records, data warehouses, and data “lakes” are treasured resources in this modern era of model training. Various applications of precision medicine, “digital twins”, and other predictive mimicries depend on having the cleanest, most-accurate data feasible.

One of these data sets is “All of Us“, maintained by the National Institute of Health. Considering its wide use, the authors ask a very reasonable question: how accurate is the information contained within? Considering it is not possible to individually verify the vast scope of clinical information as applied to each person included in the data set, these authors choose what ought to be a fairly reliable surrogate: with what frequency do male and female persons included in the data set have sex-discordant diagnoses?

The authors term their measure the “incongruence rate”, and this reflects their findings of sex-specific diagnoses incongruent with the biological sex recorded. The authors iteratively refined their list and sample set, ultimately settling on 167 sex-specific conditions where there ought to be very little ambiguity – vastly those related to pregnancy and disorders of female genitalia.

Rather amazingly, their overall finding was an “incongruence rate” of 0.86% – meaning nearly 1 in 100 of these sex-specific diagnoses were found on a person of the incorrect biological sex. For example, out of 4,200 patients coded with a finding of testicular hypofunction, 44 (1.05%) were female. Or, out of 2,101 coded for a finding of prolapse of female genital organs, 21 (1%) were male. The authors also performed further analyses exploring whether cis- or trans- gender misidentification was affecting these findings, and actually note the incongruence rate rose to 0.96%.

Specifics regarding limitations or flaws in this approach aside, the key insight is that of widespread inaccuracies within electronic health data – and systematic approaches to diagnostic incongruence may be useful methods for data cleansing.

“Navigating electronic health record accuracy by examination of sex incongruent conditions”
https://pubmed.ncbi.nlm.nih.gov/39254529/

When EHR Interventions Succeed … and Fail

This is a bit of a fascinating article with a great deal to unpack – and rightly published in a prominent journal.

The brief summary – this is a “pragmatic”, open-label, cluster-randomized trial in which a set of interventions designed to increase guideline-concordant care were rolled out via electronic health record tools. These interventions were further supported by “facilitators”, persons assigned to each practice in the intervention cohort to support uptake of the EHR tools. In this specific study, the underlying disease state was the triad of chronic kidney disease, hypertension, and type II diabetes. Each of these disease states has well-defined pathways for “optimal” therapy and escalation.

The most notable feature of this trial is the simple, negative topline result – rollout of this intervention had no reliably measurable effect on patient-oriented outcomes relating to disease progression or acute clinical deterioration. Delving below the surface provides a number of insights worthy of comment:

  • The authors could have easily made this a positive trial by having the primary outcome as change in guideline-concordant care, as many other trials have done. This is a lovely example of how surrogates for patient-oriented outcomes must always be critically appraised for the strength of their association.
  • The entire concept of this trial is likely passively traumatizing to many clinicians – being bludgeoned by electronic health record reminders and administrative nannying to increase compliance with some sort of “quality” standard. Despite all these investments, alerts, and nagging – patients did no better. As above, since many of these trials simply measure changes in behavior as their endpoints, it likely leaves many clinicians feeling sour seeing results like these where patients are no better off.
  • The care “bundle” and its lack of effect size is notable, although it ought to be noted the patient-oriented outcomes here for these chronic, life-long diseases are quite short-term. The external validity of findings demonstrated in clinical trials frequently falls short when generalized to the “real world”. The scope of the investment here and its lack of patient-oriented improvement is a reminder of the challenges in medicine regarding evidence of sufficient strength to reliably inform practice.

Not an Emergency Medicine article, per se, but certainly describes the sorts of pressures on clinical practice pervasive across specialties.

“Pragmatic Trial of Hospitalization Rate in Chronic Kidney Disease”
https://www.nejm.org/doi/full/10.1056/NEJMoa2311708