Variation Exists! Outcomes Exist!

This little article has made the rounds, primarily by those who critique it for its many flaws. However, the underlying themes can still be valid, even if an article has limitations.

This is a “there is variation in emergency physician admitting practices” article. Literally every practicing physician working in a hospital environment knows there is a broad spectrum of skill, approach to acute illness, and level of risk-tolerance. These attributes manifest in different ways, and, in emergency physicians, one is the differing likelihood two clinicians might have to admit same patient to the hospital.

In this fashion, this descriptive study is basically fine. Over time, with a few exceptions, clinicians all basically see similar distributions of patients. Thus, it is very reasonable for this study to estimate there is a 90th percentile admission rate for “chest pain” of around 56%, and a 10th percentile admission rate around 32%. The underlying principle has face validity, even if the precise numbers do not.

The second part of the analysis involves the downstream outcomes after these patients are seen and/or admitted following their emergency department visit. The first point involves whether the subsequent inpatient stay was less than 24 hours, and the second point involves downstream short- and long-term mortality. The authors also tried to evaluate the frequency and outcomes of laboratory and radiology tests ordered by emergency physicians.

Without getting too granular into the data presented, the gross pattern is that clinicians with higher admission rates were also associated with higher likelihoods of <24 hour inpatient stays. This association was most prominent, unsurprisingly, in the cohort of patients with “chest pain”. Patterns were slightly less prominent, but still present, between higher rates of radiology and laboratory testing and subsequent admission.

The kicker from this study, and the mildly controversial portion, is where these authors tie this all back to the mortality data: no association between admission rate and mortality. The general implication vilifying those clinicians with higher rates of admission, as these behaviors are generating only short (read: unnecessary) admissions of no value (no mortality difference).

Everything here is almost assuredly imprecise and unable to be generalized outside the VA system involved. There are going to be issues with confounding, mis-coded data, and variation across sites. That said, the underlying principle here is probably true – some clinicians over-test, over-consult, and over-admit to no patient-oriented benefit.

However, what is to be done? Changing clinician behavior is fraught, and it is unclear whether reduced admission rates from the highest-admitting cohort would safely target only those whose admissions were unnecessary. Worse still, attempting to change behaviors in the U.S. involves more than patient-level considerations, but issues of health system and tort culture. The best path forward probably has little to do with specifically targeting individual clinicians, or even broad complaints like “chest pain”, but identifying the specific uncertainties upon which decisions are made. Then, evidence or tools may be generated to address the specific clinical questions giving rise to the variation.

“Variation in Emergency Department Physician Admitting Practices and Subsequent Mortality”
https://jamanetwork.com/journals/jamainternalmedicine/article-abstract/2828189

Mobile Stroke Unit Propaganda Writ Large

This is yet another one of those “Get With The Guidelines” stroke analyses, a retrospective dredge with massive imbalances between groups – followed by statistical adjustments capable of turning out whichever result suits an author list with a full, dense printed page of pharma and stroke technology conflicts of interest.

In that respect, the study is unremarkable. Patients with potential stroke who were transported by Mobile Stroke Units were more likely to be functionally independent at baseline and more likely to be transported to a comprehensive stroke center. Thus, patients transported by Mobile Stroke Unit were more likely to be ambulatory and functionally independent at hospital discharge. Everything between the intake and output is just diversions.

Where it becomes further disagreeable is the accompanying editorial, written by two individuals who run Mobile Stroke Unit programs, arguing federal reimbursement ought to cover their pet projects. After a brief brush with the limitations of these data, they assert:

“it convincingly demonstrates through a large, representative, multicenter study that in real-world clinical practice, MSUs are associated with improved short-term patient outcomes”
… quite the over-glamourization of a secondary analysis of quality improvement registry data.

“the magnitude of benefit conferred by MSUs is comparable to that of other widely accepted acute stroke interventions, such as IVT in a 3-hour to 4.5-hour window and specialized stroke units”
… after multiple statistical adjustments of a grossly imbalanced cohort.

“this study demonstrates that MSUs not only benefit patients with AIS eligible for IVT, but also patients with AIS who are ineligible for IVT and patients with other forms of stroke”
… so, even if the MSU – whose mission in life is to provide tip-of-the-spear IVT – doesn’t provide acute treatment, it still confers benefit due to its soothing glow?

“This may be explained by faster imaging and blood pressure control in patients with intracerebral hemorrhage.”
… admission blood pressure for patients with SAH in this cohort was identical between MSU and EMS.

“this study rebuts concerns that by reaching and treating patients with suspected stroke earlier in their clinical course, MSUs could lead to unnecessary IVT treatments and higher rates of hemorrhagic complications. In fact, this study demonstrated the opposite: MSU care was associated with lower rates of stroke mimics”
… yes, as is the typical approach to coding these data, early administration of IVT virtually dictates a patient be coded. Once a patient has received IVT, only strong evidence to the contrary permits consideration of alternative causes of transient neurologic dysfunction – a happy accident also precluding any sICH occurring in “stroke mimics”, because there are none. To wit: only 24 of 4,218 (0.56%) of all MSU responses were “stroke mimics”, whereas 2,114 of 104,466 (2.0%) of all EMS responses were stroke mimics. When all you have is a hammer, everything you see looks like a stroke.

“Furthermore, for the broader population presenting with suspected stroke regardless of final diagnosis, the data suggest the potential for a lower risk of death.”
Again, this is magical thinking. As above, observing benefits outside the scope of the capabilities of an MSU ought prompt reconsideration statistical adjustments rather than plaudits.

These data are simply unsuited to support this sort of unabashed enthusiasm for MSUs. Rather than this editorial supporting their argument to consider funding and reimbursement structures for these tools, their biases shine through to diminish it. Regrettably, as per usual, guidelines and policy will be made by those sponsored to make the most persuasive contortion of data, rather than the most accurate.

“Mobile Stroke Unit Management in Patients With Acute Ischemic Stroke Eligible for Intravenous Thrombolysis”
https://jamanetwork.com/journals/jamaneurology/fullarticle/2824954

“Mobile Stroke Units—Time for Legislation and Remuneration”
https://jamanetwork.com/journals/jamaneurology/fullarticle/2824955

The AI Will Literally See You Now

This AI study is a fun experiment claiming to replicate the clinical gestalt generated by a physician’s initial synthesis of visual information. The ability to rapidly assess the stability and acuity of a patient is part of every experienced clinician’s refined skills – and used as a pre-test anchor for application of further diagnostic and management reasoning.

So, can AI do the same thing?

Well, “yes” and “of course not”.

In this demonstration project, these authors set up a mobile phone video camera at the foot of patients’ beds in the emergency department. Patients were instructed to perform a series of simple tasks (touch your nose, answer questions, etc.) while being recorded. Then, AI models were trained off images from these videos to predict the likelihood of admission.

The authors performed four comparisons: AI video alone, AI video + triage information (vital signs, chief complaint, age), triage information alone, and emergency severity index (ESI). In this fun demonstration, all four models were basically terrible at predicting admission (AUROCs ~0.6-0.7). But, the models incorporating video basically held their own, clearly outperforming ESI, and video + triage information was incrementally better than triage information alone.

There is very clearly nothing here suggesting this model is remotely clinical useful, or that it somehow parallels the cognitive processes of an experienced clinician. It is solely an academic exercise, though describing it as such ought not minimize the novelty of incorporating image analysis with other clinical information. As has been previously seen with other image analysis, AI models frequently trigger off image features unrelated to the clinical aspects of a case. The k-fold cross-validation used on their limited sample of 723 patients likely overfits their predictive model to their training data, leading to artificial inflation of performance. Then, “admission to hospital”, while operationally interesting, is a poor surrogate for immediate clinical needs and overall acuity. Finally, the authors also note several ethical and privacy challenges around video capture in clinical settings.

Regardless, a clever contribution to the AI clinical prediction literature.

“Hospitalization prediction from the emergency department using computer vision AI with short patient video clips”
https://www.nature.com/articles/s41746-024-01375-3

Getting Triggered By Errors in the Emergency Department

The emergency department is a place of risk and errors. Those who work in the ED are acutely aware of this, and it conjures up tremendous cognitive pressures on staff every shift.

Every ED clinician knows the most benign-appearing triage complaint may obfuscate lurking catastrophe. The vision changes that are actually an acute aortic dissection. A sore shoulder that is necrotizing fasciitis. The list goes on. If some are to be believed, hundreds of thousands are being killed each year by diagnostic errors in the ED. The reality is much lower, but still nontrivial.

But, the net effect becomes – the ED is a focus for patient safety research. In modern parlance, “diagnostic errors” become “missed opportunities for diagnosis” (MODs), and well-meaning researchers are devising further methods to shine bright lights upon our inadequacies.

This most recent publication looks at “e-Triggers” – effectively, combinations of both patient features and patient outcomes meant to retrospectively identify cohorts in which substantial numbers of patients can be found to have MODs. For example, in this paper, the authors use an “e-Trigger” modelled around posterior circulation stroke – in which the data warehouse is queried for elderly patients presenting with dizziness, at least two cerebrovascular risk factors, and whom, after initial discharge from the ED, suffered a stroke within 30 days.

When the authors dredged 8M records from the Veterans Affairs system for this, they identified 203 such instances, and manually reviewed 100 of these using a structured framework to characterize any diagnostic error present. For this “stroke” example, 47 of the 100 patients reviewed were identified to have had MODs. Per the review of records, the most common missed opportunity stemmed from inadequate physical examination and insufficient ordering of diagnostic tests. As a result, most of the patients reviewed suffered moderate or severe harms as a result of these MODs.

There is good news and bad news from this “e-Trigger” method shown here. The good news is primarily of interest to patient safety researchers, indicating this is probably a reasonable method to use for enriching populations for review to further describe the types of error occurring in specific clinical scenarios. This could lead to identification of generalizable knowledge gaps, cognitive biases, or system factors. It is also, probably, too unwieldy and labor intensive for routine punitive use targeting individual clinicians.

The bad news is primarily patient-centered. The fundamental nature of the e-Trigger structure requires a pairing of a cohort at risk and a subsequent unfortunate outcome. Thus, the harm has already reached the patient. It seems plausible suitably high-risk cohorts could be determined relatively contemporaneously, but the challenge would be finding a mechanism to detect a MOD with sufficient specificity to be deployable in clinical workflow. However, with the ability to potentially replace some previously human review steps with AI, this idea may be imminently achievable – watch this space!

“Implementation of Electronic Triggers to Identify Diagnostic Errors in Emergency Departments”
https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2827341

WOMAN-2: What Does the Robot Say?

Following on the success of Toy Story 2, Inside Out 2, and Avatar 2, we have WOMAN-2, yet another trial featuring emergency medicine’s third-favorite medication: tranexamic acid. However, where those sequels succeeded, WOMAN-2 is more like Miss Congeniality 2 – the one we’re not going to talk about again.

But, don’t take it from me – take it from the ChatGPT Agent I created!

The WOMAN-2 trial serves as a benchmark for evaluating interventions targeting postpartum hemorrhage (PPH) in women with anemia, but it underscores the challenges of translating promising pharmacological hypotheses into clinical practice. The study meticulously implemented randomization and blinding protocols, ensuring that biases from enrolment and allocation processes were minimized. Baseline characteristics were well-matched, providing a solid foundation for evaluating the intervention’s effect.

However, the trial’s design reveals critical limitations that may have compromised its ability to detect meaningful clinical benefits. The reliance on clinical judgment rather than standardized measurement tools for diagnosing PPH introduces variability, potentially diluting the observed treatment effect. Furthermore, the timing of tranexamic acid administration—after cord clamping—raises questions about its pharmacological plausibility in rapidly evolving hemorrhagic scenarios. Subgroup analyses suggest the intervention’s effect may differ in high-risk populations, such as those with antepartum hemorrhage, but the evidence is inconclusive.

The null findings for both primary and secondary outcomes should prompt reflection on trial methodology and clinical priorities. The absence of adverse events is reassuring, yet the lack of demonstrable efficacy highlights the complexity of addressing PPH in anemic populations. A focus on earlier interventions and systemic prevention of anemia may hold greater promise. Additionally, future trials should prioritize precision in outcome measurement and incorporate emerging technologies to minimize subjective assessment variability.

While the study’s funding sources were disclosed transparently, the authors’ broader involvement in tranexamic acid research warrants attention to ensure interpretations remain unbiased. Overall, the WOMAN-2 trial reinforces the urgency of optimizing anemia management and refining therapeutic strategies for PPH, underscoring the need for innovation in both clinical practice and research methodologies.

Well, there you have it – unbiased evaluation by the robot, featuring that bland robotic voice common to all its very average, very “correct” output. Interestingly, it can be trained and/or instructed to copy your writing “style”, and the output is grossly similar – but with an added layer of tryhard treacle slathered upon it.

In my brief experimentations with the Agent, it seems clear the augmentation feasible does not include writing – at least, enjoyable writing. It is superficially very competent at enumerating questions from a template, however, such as study population, primary outcomes, and specific sources of bias. For example, this agent actually executes the ROB2 questionnaire on an RCT before using that output as the foundation for its summary paragraphs. Probably good enough to give an “at a glance” summarization, but not nearly sufficient to put the research into context.

Agent aside, we’re here because WOMAN-2 is the sequel, obviously, to WOMAN – a “positive” trial, also “negative”. In WOMAN, it was positive for the endpoint of post-partum hemorrhage and death due to bleeding, but negative for the patient-oriented outcomes of overall mortality. Here in WOMAN-2, the small effect size previously seen in WOMAN has entirely vanished, leading to further questions. Where TXA seems to be most effective are instances in which it is given early – and subsequent trials “I’M Woman” and “WOMAN-3” will address these possibilities. The other possibility is, such as with gastrointestinal bleeding, certain clinical scenarios feature specific fibrinolytic activation pathways where the mild effect of TXA simply can’t move the needle.

So, nothing here changes what most of us do in the modern world – and those who have Bayesian ideas regarding the efficacy of TXA are likely going to keep using it in sub-Saharan Africa. If you are going to keep using TXA routinely, use it early and in the highest-risk populations – as the likelihood of a clinically meaningful benefit will otherwise disappear like a whisper in the wind.

“The effect of tranexamic acid on postpartum bleeding in women with moderate and severe anaemia (WOMAN-2): an international, randomised, double-blind, placebo-controlled trial”

https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(24)01749-5/fulltext

What are Children’s Lives Worth (to Save)?

This article regarding the cost of upgrading emergency departments to be “ready” for sick children has been bouncing around in the background since its publication, with some initial lay press coverage.

The general concept here is obviously laudable and the culmination of at least a decade of hard work from these authors and the team involved – with the ultimate goal of ensuring each emergency department in the country is capable of caring for critically unwell children. The gist of this most recent publication builds upon their prior work to, effectively, estimate the overall cost (~$200M) of improving “pediatric readiness”. Using that total cost, they then translate this into humanizing terms by referencing the total cost per child it might require in different states, and the number of pediatric lives saved annually.

As can be readily gleaned from this sort of thought experiment, these estimates rely upon a nested set of foundational assumptions, all of which are touched upon by prior work by this group. There are surveys of subsets of emergency departments regarding “readiness“, which involve questions such as the presence of pediatric-sized airway devices and staff dedicated to upkeep of various pediatric support. Then, they use these data and salary estimates to come up with the institutional costs of readiness. Then, they have another set of work looking at the odds ratios for increased poor outcomes at departments whose “readiness” is in the lowest percentiles, and this work is extrapolated to determine the lives saved.

Each of these pieces of work, in isolation, is reasonable, but represents a bit of a house of cards. The likelihood of imprecision is magnified as the estimates are combined. For example, how direct is the correlation between “readiness” based on certain equipment and pediatric survival, if the ED in question is a critical access hospital with low annual census? Is the cost of true clinical readiness just a part-time FTE of a nurse, or should it realistically involve the costs of skill upkeep for nurses and physicians with education or simulation?

I suspect, overall, these data understate the costs and overstate the return on investment. That said, this is still critical work even just to describe the landscape and take a stab at the scope of funding required. Likely, the best next step would be to target specific profiles of institutions, and specific types of investment, where such investment is likely to have the highest yield – as a first step on the journey towards universal readiness.

“State and National Estimates of the Cost of Emergency Department Pediatric Readiness and Lives Saved”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825748

Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

The Hand that Feeds the Hand that Feeds the Hand that ….

Pharmaceutical development is all about the blockbuster drug. Many of our brightest minds are research scientists and bioinformaticians working at translating in vitro discoveries to improving the lives of human kind.

Many of our brightest minds are also working to ensure, even if their drug candidates are – say – a little flawed, there’s a clinical trial design, trial implementation, and post-approval marketing plan to maximize return to shareholders.

A critical portion of in this chain of survival is publication – the higher the impact a journal, the better. One of the gatekeepers to publication remains peer review, a critical step in ensuring the integrity, transparency, and reproducibility of science. Naturally, this step would be independent and untainted by bias, a vigilant final guardian protecting the public.

It would … wouldn’t it?

This brief report from JAMA finds it would absurd even to imagine such a fanciful state of affairs. Evaluating lists of peer reviewers from The BMJ, JAMA, The Lancet, and The New England Journal of Medicine from 2022, the authors analyzed 1,962 U.S.-based physician reviewers. Of these reviewers, the Open Payments database indicated 58.9% of these reviewers had received payments from industry, totalling USD$1.06 billion between 2020 and 2022. The vast majority – $1.01 billion – were payments supporting research activities, with a mere $60M going towards such things as consulting fees, speaker fees, honoraria, etc. Male reviewers and those in medical and surgical specialties, rather than primary care or hospital-based specialties, were the dominant recipients of said payments.

While these data are not meant to illuminate some sort of dark money ecosystem, it is clear the “peers” doing the reviews are playing at the same game. There is going to be an obvious bias towards allowing publication of content and spin consistent with the output the reviewers themselves would anticipate using in their own work. A receptive audience, if you would.

Just another happy reminder how so much of our medical practice is swept along in a current powered by many moneyed forces at work.

“Payments by Drug and Medical Device Manufacturers to US Peer Reviewers
of Major Medical Journals”
https://jamanetwork.com/journals/jama/article-abstract/2824834

Publication Potpourri

Clearing the backlog of mildly interesting articles that will never get a full write-up – here’s a quick hit of the most interesting lucky 13!

“Lactated Ringer vs Normal Saline Solution During Sickle Cell Vaso-Occlusive Episodes”
A “target trial emulation” providing observational evidence supporting the superiority of lactated ringers solution over normal saline for the resuscitation of patients being admitted with sickle cell crises. However, only a small fraction of patients actually received LR, and those who did received smaller amounts overall. Frankly, it’s hard to separate these observations from the general concern euvolemic SCD patients are simply receiving far too much fluid during admission.
https://pubmed.ncbi.nlm.nih.gov/39250114/

“Empathy and clarity in GPT-4-Generated Emergency Department Discharge Letters”
The cold, unfeeling computer generates better feels than emergency clinicians. Bland treacle, no doubt, but generic bland treacle will beat the terse output of a distracted human any day. The question now is how to combine the two to create a better human, where suitable.
https://www.medrxiv.org/content/10.1101/2024.10.07.24315034v1

“Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG- enhanced vs. Human-Generated Medical Templates”
Study number n-million demonstrating retrieval-augmented generation – that is to say, using a relatively sophisticated series of prompts to an LLM – improves output and reduces “hallucinations”. In this specific instance, the LLM was effectively mimicking the output of BMJ Best Practice templates.
https://www.medrxiv.org/content/10.1101/2024.09.27.24314506v1

“Towards Democratization of Subspeciality Medical Expertise”
This is one of the AIME demonstration projects, a Google Deepmind output in which an additional layer of “conversational” training has been superimposed upon the underlying model to improve response to consultation-style questions. In this example, the conversational training is tuned to match the response style of genetic cardiology specialists from Stanford University – and the LLM content and style was arguably rated identically to the human specialists.
https://arxiv.org/abs/2410.03741

“The Limits of Clinician Vigilance as an AI Safety Bulwark”
A nice little commentary effectively articulating the elephant in the room: humans are terrible at “vigilance”. The present state of AI/LLM deployment in clinical medicine is replete with omissions and inaccuracies, and it’s not reasonable to simply trust clinicians to catch the mistakes. That said, the five suggested strategies to address vigilance seem … precarious.
https://jamanetwork.com/journals/jama/fullarticle/2816582

“Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest”
Who wins an essay competition on controversial topics in stroke and neurology – humans or LLMs? And can reviewers accurately guess which essays are human- versus LLM-generated? The answer, rather distressingly, is that reviewers mostly couldn’t distinguish between author types, and LLM composition quality was arguably higher than human.
https://pubmed.ncbi.nlm.nih.gov/39224979/

“Invasive Treatment Strategy for Older Patients with Myocardial Infarction”
The amusingly named “SENIOR-RITA” trial in which elderly patients with NSTEMI were randomized to an invasive strategy versus a medical management strategy. While this may seem odd to those in the U.S., frailty and baseline life-expectancy are typical considerations for acute care in other countries. In these frail elderly, invasive strategies reduce downstream non-fatal MI, but had no effect on cardiovascular death or the overall composite outcome.
https://pubmed.ncbi.nlm.nih.gov/39225274/

“Thrombolysis After Dabigatran Reversal for Acute Ischemic Stroke: A National Registry-Based Study and Meta-Analysis”
Just another bit from our neurology colleagues patting themselves on the back for doing backflips in order to give everyone thrombolysis. Registry data paired with a garbage-in-garbage-out systematic review and meta-analysis just amplifies the biases prevalent in the underlying practice culture.
https://pubmed.ncbi.nlm.nih.gov/39255429/

“Tenecteplase vs Alteplase for Patients With Acute Ischemic Stroke: The ORIGINAL Randomized Clinical Trial”
This trial makes a claim to be ORIGINAL, but at this point – there’s really no general question remaining whether tenecteplase is a valid alternative to alteplase. It is reasonable to test specific doses in various populations with a goal of minimizing harms, of course.
https://pubmed.ncbi.nlm.nih.gov/39264623/

“Technology-Supported Self-Triage Decision Making: A Mixed-Methods Study”
Can laypersons improve their self-triage decision-making with use of technology? This little preprint tests the Ada Health “symptom checker” app against interaction with a ChatGPT LLM and favors use of the symptom checker. Not hardly rigorous enough to discard the chatbot as a possible tool, but certainly the LLM needs more prompt engineering and/or domain-specific training than just “out of the box”.
https://www.medrxiv.org/content/10.1101/2024.09.12.24313558v1

“Class I Recalls of Cardiovascular Devices Between 2013 and 2022 : A Cross-Sectional Analysis”
A brief report looking at recalled cardiovascular devices is insufficient to make any broad conclusions, but certainly demonstrates the regulatory bar for approval is inadequate. Most did not require pre-market clinical testing, and those that did used surrogate or composite endpoints to support approval.
https://pubmed.ncbi.nlm.nih.gov/39284187/

“Effectiveness of Direct Admission Compared to Admission Through the Emergency Department: A Stepped-Wedge Cluster-Randomized Trial”
A bit of an odd question whether a patient with known requirement for admission needs to stop through the emergency department, or whether the patient can go straight to the ward. Unless a patient requires immediate resuscitation with the resources of an ED, it is very clearly appropriate for a patient to be directly admitted. That said, doing so requires new processes and practices – and this trial demonstrates such processes are feasible and safe (and almost certainly cheaper to avoid ED billing!)
https://pubmed.ncbi.nlm.nih.gov/39301600/

“Restrictive vs Liberal Transfusion Strategy in Patients With Acute Brain Injury: The TRAIN Randomized Clinical Trial”
The optimal cut-off is not known, but in this trial, the threshold for transfusion was 9g/dL – and patients randomized to the liberal transfusion strategy did better. An injured brain does not like hypotension and it does not like anemia.
https://pubmed.ncbi.nlm.nih.gov/39382241/

Let ChatGPT Guide Your Hand

This exploration of LLMs in the emergency department is a bit unique in its conceptualization. While most demonstrations of generative AI applied to the ED involve summarization of records, digital scribing, or composing discharge letters, this attempts clinical decision-support. That is to say, rather than attempting to streamline or deburden clinicians from some otherwise time-intensive task, the LLM here taps into its ability to act as a generalized prediction engine – and tries its hand at prescriptive recommendations.

Specifically, the LLM – here GPT-3.5T and GPT-4T – is asked:

  • Should this patient be admitted to the hospital?
  • Does this patient require radiologic investigations?
  • Does this patient require antibiotics?

Considering we’ve seen general LLMs perform admirably on various medical licensing examinations, ought not these tools be able to get the meat off the bone in real life?

Before even considering the results, there are multiple fundamental considerations taking this published exploration into the realm of curiosity rather than insightfulness:

  • This communication was submitted in Oct 2023 – meaning the LLMs used, while modern at the time, are debatably becoming obsolete. Likewise, the prompting methods are a bit simplistic and anachronistic – evidence has shown advantage to carefully constructed augmented retrieval instructions.
  • The LLM was fed solely physician clinical notes – specifically the “clinical history”, “examination”, and “assessment/plan”. The LLM was therefore generating responses based on, effectively, an isolated completed medical assessment of a patient. This method excludes other data present in the record (vital signs, laboratory results, etc.), while also relying upon finished human documentation for its “decision-support”.
  • The prompts – “should”/”does” – replicate the intent of the decision-support concept of the exploration, but not the retrospective nature of the content. Effectively, what ought to have been asked of the LLMs – and the clinician reviewers – was “did this patient get admitted to the hospital?” or “did this patient receive antibiotics?” It would be mildly interesting to shift the question away from a somewhat subjective value judgement to a bit of an intent inference exercise.
  • The clinician reviewers – one resident physician and one attending physician – did not much agree (73-83% agreement) on admission, radiology, and antibiotic determinations. It becomes very difficult to evaluate any sort of predictive or prescriptive intervention when the “gold standard” is so diaphanous. There is truly no accepted “gold standard” for these sorts of questions, as individual clinician admission rates and variations in practice are exceedingly wide. This is evidenced by the general inaccuracy displayed by just these two clinicians, whose own individual accuracy ranged from 74-83%, on top of that poor agreement.

Now, after scratching the tip of the methodology and translation iceberg, the results: unusable.

GPT-4T, as to expected, outperformed GPT-3.5T. But, regardless of LLM prompted, there were clear patterns of inadequacy. Each LLM was quite sensitive in its prescription of admission or radiologic evaluation – but at the extreme sacrifice of specificity, with “false positives” nearly equalling the “true positives” in some cases. The reverse was true for antibiotic prescription, with a substantial drop in sensitivity, but improved specificity. For what its worth, of course, U.S. emergency departments are general cesspools of aggressive empiric antibiotic coverage, driven by CMS regulations – so it may in fact be the LLM displaying astute clinical judgement, here. The “sepsis measure fallout gestapo” might disagree, however.

I can envision this approach is not entirely hopeless. The increasing use of LLM digital scribes is likely to improve early data available to such predictive or prescriptive models. Other structured clinical data collected by electronic systems may be incorporated. Likewise, there are other clinical notes of value potentially available, including nursing and triage documentation. I don’t hardly find this to be a dead-end idea, at all, but the limitations of this exploration don’t shed much light except to direct future efforts.

“Evaluating the use of large language models to provide clinical recommendations in the Emergency Department”
https://www.nature.com/articles/s41467-024-52415-1