Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

Publication Potpourri

Clearing the backlog of mildly interesting articles that will never get a full write-up – here’s a quick hit of the most interesting lucky 13!

“Lactated Ringer vs Normal Saline Solution During Sickle Cell Vaso-Occlusive Episodes”
A “target trial emulation” providing observational evidence supporting the superiority of lactated ringers solution over normal saline for the resuscitation of patients being admitted with sickle cell crises. However, only a small fraction of patients actually received LR, and those who did received smaller amounts overall. Frankly, it’s hard to separate these observations from the general concern euvolemic SCD patients are simply receiving far too much fluid during admission.
https://pubmed.ncbi.nlm.nih.gov/39250114/

“Empathy and clarity in GPT-4-Generated Emergency Department Discharge Letters”
The cold, unfeeling computer generates better feels than emergency clinicians. Bland treacle, no doubt, but generic bland treacle will beat the terse output of a distracted human any day. The question now is how to combine the two to create a better human, where suitable.
https://www.medrxiv.org/content/10.1101/2024.10.07.24315034v1

“Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG- enhanced vs. Human-Generated Medical Templates”
Study number n-million demonstrating retrieval-augmented generation – that is to say, using a relatively sophisticated series of prompts to an LLM – improves output and reduces “hallucinations”. In this specific instance, the LLM was effectively mimicking the output of BMJ Best Practice templates.
https://www.medrxiv.org/content/10.1101/2024.09.27.24314506v1

“Towards Democratization of Subspeciality Medical Expertise”
This is one of the AIME demonstration projects, a Google Deepmind output in which an additional layer of “conversational” training has been superimposed upon the underlying model to improve response to consultation-style questions. In this example, the conversational training is tuned to match the response style of genetic cardiology specialists from Stanford University – and the LLM content and style was arguably rated identically to the human specialists.
https://arxiv.org/abs/2410.03741

“The Limits of Clinician Vigilance as an AI Safety Bulwark”
A nice little commentary effectively articulating the elephant in the room: humans are terrible at “vigilance”. The present state of AI/LLM deployment in clinical medicine is replete with omissions and inaccuracies, and it’s not reasonable to simply trust clinicians to catch the mistakes. That said, the five suggested strategies to address vigilance seem … precarious.
https://jamanetwork.com/journals/jama/fullarticle/2816582

“Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest”
Who wins an essay competition on controversial topics in stroke and neurology – humans or LLMs? And can reviewers accurately guess which essays are human- versus LLM-generated? The answer, rather distressingly, is that reviewers mostly couldn’t distinguish between author types, and LLM composition quality was arguably higher than human.
https://pubmed.ncbi.nlm.nih.gov/39224979/

“Invasive Treatment Strategy for Older Patients with Myocardial Infarction”
The amusingly named “SENIOR-RITA” trial in which elderly patients with NSTEMI were randomized to an invasive strategy versus a medical management strategy. While this may seem odd to those in the U.S., frailty and baseline life-expectancy are typical considerations for acute care in other countries. In these frail elderly, invasive strategies reduce downstream non-fatal MI, but had no effect on cardiovascular death or the overall composite outcome.
https://pubmed.ncbi.nlm.nih.gov/39225274/

“Thrombolysis After Dabigatran Reversal for Acute Ischemic Stroke: A National Registry-Based Study and Meta-Analysis”
Just another bit from our neurology colleagues patting themselves on the back for doing backflips in order to give everyone thrombolysis. Registry data paired with a garbage-in-garbage-out systematic review and meta-analysis just amplifies the biases prevalent in the underlying practice culture.
https://pubmed.ncbi.nlm.nih.gov/39255429/

“Tenecteplase vs Alteplase for Patients With Acute Ischemic Stroke: The ORIGINAL Randomized Clinical Trial”
This trial makes a claim to be ORIGINAL, but at this point – there’s really no general question remaining whether tenecteplase is a valid alternative to alteplase. It is reasonable to test specific doses in various populations with a goal of minimizing harms, of course.
https://pubmed.ncbi.nlm.nih.gov/39264623/

“Technology-Supported Self-Triage Decision Making: A Mixed-Methods Study”
Can laypersons improve their self-triage decision-making with use of technology? This little preprint tests the Ada Health “symptom checker” app against interaction with a ChatGPT LLM and favors use of the symptom checker. Not hardly rigorous enough to discard the chatbot as a possible tool, but certainly the LLM needs more prompt engineering and/or domain-specific training than just “out of the box”.
https://www.medrxiv.org/content/10.1101/2024.09.12.24313558v1

“Class I Recalls of Cardiovascular Devices Between 2013 and 2022 : A Cross-Sectional Analysis”
A brief report looking at recalled cardiovascular devices is insufficient to make any broad conclusions, but certainly demonstrates the regulatory bar for approval is inadequate. Most did not require pre-market clinical testing, and those that did used surrogate or composite endpoints to support approval.
https://pubmed.ncbi.nlm.nih.gov/39284187/

“Effectiveness of Direct Admission Compared to Admission Through the Emergency Department: A Stepped-Wedge Cluster-Randomized Trial”
A bit of an odd question whether a patient with known requirement for admission needs to stop through the emergency department, or whether the patient can go straight to the ward. Unless a patient requires immediate resuscitation with the resources of an ED, it is very clearly appropriate for a patient to be directly admitted. That said, doing so requires new processes and practices – and this trial demonstrates such processes are feasible and safe (and almost certainly cheaper to avoid ED billing!)
https://pubmed.ncbi.nlm.nih.gov/39301600/

“Restrictive vs Liberal Transfusion Strategy in Patients With Acute Brain Injury: The TRAIN Randomized Clinical Trial”
The optimal cut-off is not known, but in this trial, the threshold for transfusion was 9g/dL – and patients randomized to the liberal transfusion strategy did better. An injured brain does not like hypotension and it does not like anemia.
https://pubmed.ncbi.nlm.nih.gov/39382241/

Let ChatGPT Guide Your Hand

This exploration of LLMs in the emergency department is a bit unique in its conceptualization. While most demonstrations of generative AI applied to the ED involve summarization of records, digital scribing, or composing discharge letters, this attempts clinical decision-support. That is to say, rather than attempting to streamline or deburden clinicians from some otherwise time-intensive task, the LLM here taps into its ability to act as a generalized prediction engine – and tries its hand at prescriptive recommendations.

Specifically, the LLM – here GPT-3.5T and GPT-4T – is asked:

  • Should this patient be admitted to the hospital?
  • Does this patient require radiologic investigations?
  • Does this patient require antibiotics?

Considering we’ve seen general LLMs perform admirably on various medical licensing examinations, ought not these tools be able to get the meat off the bone in real life?

Before even considering the results, there are multiple fundamental considerations taking this published exploration into the realm of curiosity rather than insightfulness:

  • This communication was submitted in Oct 2023 – meaning the LLMs used, while modern at the time, are debatably becoming obsolete. Likewise, the prompting methods are a bit simplistic and anachronistic – evidence has shown advantage to carefully constructed augmented retrieval instructions.
  • The LLM was fed solely physician clinical notes – specifically the “clinical history”, “examination”, and “assessment/plan”. The LLM was therefore generating responses based on, effectively, an isolated completed medical assessment of a patient. This method excludes other data present in the record (vital signs, laboratory results, etc.), while also relying upon finished human documentation for its “decision-support”.
  • The prompts – “should”/”does” – replicate the intent of the decision-support concept of the exploration, but not the retrospective nature of the content. Effectively, what ought to have been asked of the LLMs – and the clinician reviewers – was “did this patient get admitted to the hospital?” or “did this patient receive antibiotics?” It would be mildly interesting to shift the question away from a somewhat subjective value judgement to a bit of an intent inference exercise.
  • The clinician reviewers – one resident physician and one attending physician – did not much agree (73-83% agreement) on admission, radiology, and antibiotic determinations. It becomes very difficult to evaluate any sort of predictive or prescriptive intervention when the “gold standard” is so diaphanous. There is truly no accepted “gold standard” for these sorts of questions, as individual clinician admission rates and variations in practice are exceedingly wide. This is evidenced by the general inaccuracy displayed by just these two clinicians, whose own individual accuracy ranged from 74-83%, on top of that poor agreement.

Now, after scratching the tip of the methodology and translation iceberg, the results: unusable.

GPT-4T, as to expected, outperformed GPT-3.5T. But, regardless of LLM prompted, there were clear patterns of inadequacy. Each LLM was quite sensitive in its prescription of admission or radiologic evaluation – but at the extreme sacrifice of specificity, with “false positives” nearly equalling the “true positives” in some cases. The reverse was true for antibiotic prescription, with a substantial drop in sensitivity, but improved specificity. For what its worth, of course, U.S. emergency departments are general cesspools of aggressive empiric antibiotic coverage, driven by CMS regulations – so it may in fact be the LLM displaying astute clinical judgement, here. The “sepsis measure fallout gestapo” might disagree, however.

I can envision this approach is not entirely hopeless. The increasing use of LLM digital scribes is likely to improve early data available to such predictive or prescriptive models. Other structured clinical data collected by electronic systems may be incorporated. Likewise, there are other clinical notes of value potentially available, including nursing and triage documentation. I don’t hardly find this to be a dead-end idea, at all, but the limitations of this exploration don’t shed much light except to direct future efforts.

“Evaluating the use of large language models to provide clinical recommendations in the Emergency Department”
https://www.nature.com/articles/s41467-024-52415-1

All the Pregnant Men in the EHR

Electronic health records, data warehouses, and data “lakes” are treasured resources in this modern era of model training. Various applications of precision medicine, “digital twins”, and other predictive mimicries depend on having the cleanest, most-accurate data feasible.

One of these data sets is “All of Us“, maintained by the National Institute of Health. Considering its wide use, the authors ask a very reasonable question: how accurate is the information contained within? Considering it is not possible to individually verify the vast scope of clinical information as applied to each person included in the data set, these authors choose what ought to be a fairly reliable surrogate: with what frequency do male and female persons included in the data set have sex-discordant diagnoses?

The authors term their measure the “incongruence rate”, and this reflects their findings of sex-specific diagnoses incongruent with the biological sex recorded. The authors iteratively refined their list and sample set, ultimately settling on 167 sex-specific conditions where there ought to be very little ambiguity – vastly those related to pregnancy and disorders of female genitalia.

Rather amazingly, their overall finding was an “incongruence rate” of 0.86% – meaning nearly 1 in 100 of these sex-specific diagnoses were found on a person of the incorrect biological sex. For example, out of 4,200 patients coded with a finding of testicular hypofunction, 44 (1.05%) were female. Or, out of 2,101 coded for a finding of prolapse of female genital organs, 21 (1%) were male. The authors also performed further analyses exploring whether cis- or trans- gender misidentification was affecting these findings, and actually note the incongruence rate rose to 0.96%.

Specifics regarding limitations or flaws in this approach aside, the key insight is that of widespread inaccuracies within electronic health data – and systematic approaches to diagnostic incongruence may be useful methods for data cleansing.

“Navigating electronic health record accuracy by examination of sex incongruent conditions”
https://pubmed.ncbi.nlm.nih.gov/39254529/

When EHR Interventions Succeed … and Fail

This is a bit of a fascinating article with a great deal to unpack – and rightly published in a prominent journal.

The brief summary – this is a “pragmatic”, open-label, cluster-randomized trial in which a set of interventions designed to increase guideline-concordant care were rolled out via electronic health record tools. These interventions were further supported by “facilitators”, persons assigned to each practice in the intervention cohort to support uptake of the EHR tools. In this specific study, the underlying disease state was the triad of chronic kidney disease, hypertension, and type II diabetes. Each of these disease states has well-defined pathways for “optimal” therapy and escalation.

The most notable feature of this trial is the simple, negative topline result – rollout of this intervention had no reliably measurable effect on patient-oriented outcomes relating to disease progression or acute clinical deterioration. Delving below the surface provides a number of insights worthy of comment:

  • The authors could have easily made this a positive trial by having the primary outcome as change in guideline-concordant care, as many other trials have done. This is a lovely example of how surrogates for patient-oriented outcomes must always be critically appraised for the strength of their association.
  • The entire concept of this trial is likely passively traumatizing to many clinicians – being bludgeoned by electronic health record reminders and administrative nannying to increase compliance with some sort of “quality” standard. Despite all these investments, alerts, and nagging – patients did no better. As above, since many of these trials simply measure changes in behavior as their endpoints, it likely leaves many clinicians feeling sour seeing results like these where patients are no better off.
  • The care “bundle” and its lack of effect size is notable, although it ought to be noted the patient-oriented outcomes here for these chronic, life-long diseases are quite short-term. The external validity of findings demonstrated in clinical trials frequently falls short when generalized to the “real world”. The scope of the investment here and its lack of patient-oriented improvement is a reminder of the challenges in medicine regarding evidence of sufficient strength to reliably inform practice.

Not an Emergency Medicine article, per se, but certainly describes the sorts of pressures on clinical practice pervasive across specialties.

“Pragmatic Trial of Hospitalization Rate in Chronic Kidney Disease”
https://www.nejm.org/doi/full/10.1056/NEJMoa2311708

Quick Hit: Elders Risk Assessment

A few words regarding an article highlighted in one of my daily e-mails – a report regarding the Elders Risk Assessment tool (ERA) from the Mayo Clinic.

The key to the highlight is the assertion this score can be easily calculated and presented in-context to clinicians during primary care visits, allowing patients with higher scores to be easily identified for preventive interventions. With an AUC of 0.84, the authors are rather chuffed about the overall performance. In fact, they close their discussion with this rosy outlook:

The adoption of a proactive approach in primary care, along with the implementation of a predictive clinical score, could play a pivotal role in preventing critical ill- nesses, benefiting patients and optimizing healthcare resource allocation.

Completely missed by their limitations is that prognostic scores are not prescriptive. The ERA is based on age, recent hospitalizations, and chronic illness. The extent to which the management of any of these issues can be addressed “proactively” in the current primary care environment, and demonstrate a positive impact on patient-oriented outcomes, remains to be demonstrated.

To claim a scoring system is going to better the world, it is necessary to compare decisions made with formal prompting by the score to decisions made without – several steps removed from performing a retrospective evaluation to generate an AUC. It ought also be appreciated some decisions based on high ERA scores will increase resource utilization without a corresponding beneficial effect on health, while lower scores may likewise inappropriately bias clinical judgement.

This article has only passing applicability to emergency medicine, but the same issues regarding the disutility of “prognosis” apply widely.

“Individualized prediction of critical illness in older adults: Validation of an elders risk assessment model”
https://agsjournals.onlinelibrary.wiley.com/doi/abs/10.1111/jgs.18861

Everyone’s Got ChatGPT Fever!

And, most importantly, if you put the symptoms related to your fever into ChatGPT, it will generate a reasonable differential diagnosis.

“So?”

This brief report in Annals describes a retrospective experiment in which 30 written case summaries lifted from the electronic documentation system were fed to either clinician teams or ChatGPT. The clinician teams (either an internal medicine or emergency medicine resident, plus a supervising specialist) and ChatGPT were asked to generate a “top 5” of differential diagnoses, and then settle upon one “most likely” diagnosis. Each case was tested both solely on the recorded narrative, as well as with laboratory results added.

The long and short of this brief report is the lists of diagnoses generated contained the correct final diagnosis with similar frequency – about 80-90% of the time. The correct leading diagnosis was chosen from these lists about 60% of the time by each. Overlap between clinicians and ChatGPT in their lists of diagnoses was, likewise, about 50-60%.

The common reaction: wow! ChatGPT is every bit as good as a team of clinicians. We ought to use ChatGPT to fill in gaps where clinician resources are scarce, or to generally augment clinicians contemporaneously.

This may indeed be a valid reaction, and, looking at the healthcare funding environment, it is clear billions of dollars are being thrown at the optimistic interpretation of these types of studies. However, what is lacking from these studies are any sort of comparison. Prior to ChatGPT, clinicians did not operate in an information resource vacuum, as is frequently the case in these contrived situations. When faced with clinical ambiguity, clinicians (and patients) have used general search engines, in addition to medical knowledge-specific resources (e.g., UpToDate) as augments. These ChatGPT studies are generally, much like many decision-support studies, quite light on testing their clinical utility and implementation in real-world contexts.

Medical applications of large language models are certainly interesting, but it is always valuable to remember LLMs are not “intelligent” – they are simply pattern-matching and generation tools. They may, or may not, provide reliable improvement over current information search strategies available to clinicians.

ChatGPT and Generating a Differential Diagnosis Early in an Emergency Department Presentation

When ChatGPT Writes a Research Paper

It is safe to say the honeymoon phase of large language models has started to fade a bit. Yes, they can absolutely pass a medical licensing examination when given carefully constructed prompts. The focus now turns to practical applications – like, in this example, using ChatGPT to write an entire scientific paper for you!

There is no reason to go through the details of the paper, the content, the findings, or any aspect of fruit and vegetable consumption. It is linked only to prove that it exists, and was written in its entirety by an LLM. To create the article, the authors used prompts containing the actual data set, prompts for an introduction, summary tables, and a discussion – impressively, as part of an automated prompting engine written by the authors, not just a laborious manual process. The initial output was not, as you might expect, entirely appropriate, requiring substantial re-prompting and revision – but, in the end, as you may see, the output resembles a paper basically indistinguishable from an undergraduate or graduate student-level output.

There were, of course, hallucinations, banal unfounded declarations, and the expected simply fabricated references. But, considering a year or two ago, no one would have ever talked about or suggested a LLM could write any semblance of a robust research paper, this is still fairly amazing. Considering this sort of writing is close to peak intellectual accomplishment, it’s fair to say similar automated techniques may replace a great deal of lesser content generation.

“The Impact of Fruit and Vegetable Consumption and Physical Activity on Diabetes Risk among Adults”
https://www.nature.com/articles/d41586-023-02218-z

Sepsis Alerts Save Lives!

Not doctors, of course – the alerts.

This is one of those “we had to do it, so we studied it” sorts of evaluations because, as most of us have experienced, the decision to implement the sepsis alerts is not always driven by pent-up clinician demand.

The authors describe this as sort of “natural experiment”, where a phased or stepped roll-out allows for some presumption of control for unmeasured cultural and process confounders limiting pre-/post- studies. In this case, the decision was made to implement the “St John Sepsis Algorithm” developed by Cerner. This algorithm is composed of two alerts – one somewhat SIRS- or inflammation-based for “suspicion of sepsis”, and one with organ dysfunction for “suspicion of severe sepsis”. The “phased” part of the roll-out involved turning on the alerts first in the acute inpatient wards, then the Emergency Department, and then the specialty wards. Prior to being activated, however, the alert algorithm ran “silently” to create the comparison group of those for whom an alert would have been triggered.

The short summary:

  • In their inpatient wards, mortality among patients meeting alert criteria decreased from 6.4% to 5.1%.
  • In their Emergency Department, admitted patients meeting alert criteria were less likely to have a ≥7 day inpatient length-of-stay.
  • In their Emergency Departments, antibiotic administration of patients meeting alert criteria within 1 hour of the alert firing increased from 36.9% to 44.7%.

There are major problems here, of course, both intrinsic to their study design and otherwise. While it is a “multisite” study, there are only two hospitals involved. The “phased” implementation not the typical different-hospitals-at-different-times, but within each hospital. They report inpatient mortality changes without actually reporting any changes in clinician behavior between the pre- and post- phases, i.e., what did clinicians actually do in response to the alerts? Then, they look at timely antibiotic administration, but they do not look at general antibiotic volume or the various unintended consequences potentially associated with this alert. Did admission rates increase? Did percentages of discharged patients receiving intravenous antibiotics increase? Did clostridium difficle infection rates increase?

Absent the funding and infrastructure to better prospectively study these sorts of interventions, these “natural experiments” can be useful evidence. However, these authors do not seem to have taken an expansive enough view of their data with which to fully support an unquestioned conclusion of benefit to the alert intervention.

“Evaluating a digital sepsis alert in a London multisite hospital network: a natural experiment using electronic health record data”

https://academic.oup.com/jamia/advance-article/doi/10.1093/jamia/ocz186/5607431

Saving Lives with Lifesaving Devices

Automated electronic defibrillators are quite useful in many cases of out-of-hospital cardiac arrest – specifically, those so-called “shockable rhythms” in which defibrillation is indicated. Ventricular fibrillation, if treated with only bystander conventional cardiopulmonary resuscitation, has dismal survival. However, when no AED is available, the issue is mooted.

This is an interesting simulation exercise looking to improve access to AEDs such that availability might be improved in cases of cardiac arrest. These authors pulled every AED location in Denmark, along with the locations of all OHCAs between 2007 and 2016. Then, they used all OHCA from 1994 until 2007 as their “training set” to help derive an optimal location for AED placement with which to simulate. Optimal AED placements were dichotomized into “intervention #1” and “intervention #2” based on whether their building location provided business-hours access, or 24/7 access.

In the “real world” of 2007 to 2016, AED coverage of OHCA was 22.0%, leading to 14.6% bystander defibrillation. Based on their simulations and optimization, these authors propose potential 33.4% and 43.1% coverage, depending on business hours, leading to increases in bystander defibrillation of 22.5% and 26.9%. This improved coverage and bystander defibrillation would give an absolute increase in survival, based on the observed rate, of 3.4% and 4.1% over the study period.

This is obviously a simulation, meaning all these projected numbers are ficticious and subject to the imprecision of the inputs, along with extrapolated outcomes. However, the underlying principle of trying to intelligently match AED access to OHCA volume is certainly reasonable. It is hard to argue against distributing a limited resource in some data-driven fashion.

“In Silico Trial of Optimized Versus Actual Public Defibrillator Locations”
https://www.sciencedirect.com/science/article/pii/S0735109719361649