Publication Potpourri

Clearing the backlog of mildly interesting articles that will never get a full write-up – here’s a quick hit of the most interesting lucky 13!

“Lactated Ringer vs Normal Saline Solution During Sickle Cell Vaso-Occlusive Episodes”
A “target trial emulation” providing observational evidence supporting the superiority of lactated ringers solution over normal saline for the resuscitation of patients being admitted with sickle cell crises. However, only a small fraction of patients actually received LR, and those who did received smaller amounts overall. Frankly, it’s hard to separate these observations from the general concern euvolemic SCD patients are simply receiving far too much fluid during admission.
https://pubmed.ncbi.nlm.nih.gov/39250114/

“Empathy and clarity in GPT-4-Generated Emergency Department Discharge Letters”
The cold, unfeeling computer generates better feels than emergency clinicians. Bland treacle, no doubt, but generic bland treacle will beat the terse output of a distracted human any day. The question now is how to combine the two to create a better human, where suitable.
https://www.medrxiv.org/content/10.1101/2024.10.07.24315034v1

“Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG- enhanced vs. Human-Generated Medical Templates”
Study number n-million demonstrating retrieval-augmented generation – that is to say, using a relatively sophisticated series of prompts to an LLM – improves output and reduces “hallucinations”. In this specific instance, the LLM was effectively mimicking the output of BMJ Best Practice templates.
https://www.medrxiv.org/content/10.1101/2024.09.27.24314506v1

“Towards Democratization of Subspeciality Medical Expertise”
This is one of the AIME demonstration projects, a Google Deepmind output in which an additional layer of “conversational” training has been superimposed upon the underlying model to improve response to consultation-style questions. In this example, the conversational training is tuned to match the response style of genetic cardiology specialists from Stanford University – and the LLM content and style was arguably rated identically to the human specialists.
https://arxiv.org/abs/2410.03741

“The Limits of Clinician Vigilance as an AI Safety Bulwark”
A nice little commentary effectively articulating the elephant in the room: humans are terrible at “vigilance”. The present state of AI/LLM deployment in clinical medicine is replete with omissions and inaccuracies, and it’s not reasonable to simply trust clinicians to catch the mistakes. That said, the five suggested strategies to address vigilance seem … precarious.
https://jamanetwork.com/journals/jama/fullarticle/2816582

“Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest”
Who wins an essay competition on controversial topics in stroke and neurology – humans or LLMs? And can reviewers accurately guess which essays are human- versus LLM-generated? The answer, rather distressingly, is that reviewers mostly couldn’t distinguish between author types, and LLM composition quality was arguably higher than human.
https://pubmed.ncbi.nlm.nih.gov/39224979/

“Invasive Treatment Strategy for Older Patients with Myocardial Infarction”
The amusingly named “SENIOR-RITA” trial in which elderly patients with NSTEMI were randomized to an invasive strategy versus a medical management strategy. While this may seem odd to those in the U.S., frailty and baseline life-expectancy are typical considerations for acute care in other countries. In these frail elderly, invasive strategies reduce downstream non-fatal MI, but had no effect on cardiovascular death or the overall composite outcome.
https://pubmed.ncbi.nlm.nih.gov/39225274/

“Thrombolysis After Dabigatran Reversal for Acute Ischemic Stroke: A National Registry-Based Study and Meta-Analysis”
Just another bit from our neurology colleagues patting themselves on the back for doing backflips in order to give everyone thrombolysis. Registry data paired with a garbage-in-garbage-out systematic review and meta-analysis just amplifies the biases prevalent in the underlying practice culture.
https://pubmed.ncbi.nlm.nih.gov/39255429/

“Tenecteplase vs Alteplase for Patients With Acute Ischemic Stroke: The ORIGINAL Randomized Clinical Trial”
This trial makes a claim to be ORIGINAL, but at this point – there’s really no general question remaining whether tenecteplase is a valid alternative to alteplase. It is reasonable to test specific doses in various populations with a goal of minimizing harms, of course.
https://pubmed.ncbi.nlm.nih.gov/39264623/

“Technology-Supported Self-Triage Decision Making: A Mixed-Methods Study”
Can laypersons improve their self-triage decision-making with use of technology? This little preprint tests the Ada Health “symptom checker” app against interaction with a ChatGPT LLM and favors use of the symptom checker. Not hardly rigorous enough to discard the chatbot as a possible tool, but certainly the LLM needs more prompt engineering and/or domain-specific training than just “out of the box”.
https://www.medrxiv.org/content/10.1101/2024.09.12.24313558v1

“Class I Recalls of Cardiovascular Devices Between 2013 and 2022 : A Cross-Sectional Analysis”
A brief report looking at recalled cardiovascular devices is insufficient to make any broad conclusions, but certainly demonstrates the regulatory bar for approval is inadequate. Most did not require pre-market clinical testing, and those that did used surrogate or composite endpoints to support approval.
https://pubmed.ncbi.nlm.nih.gov/39284187/

“Effectiveness of Direct Admission Compared to Admission Through the Emergency Department: A Stepped-Wedge Cluster-Randomized Trial”
A bit of an odd question whether a patient with known requirement for admission needs to stop through the emergency department, or whether the patient can go straight to the ward. Unless a patient requires immediate resuscitation with the resources of an ED, it is very clearly appropriate for a patient to be directly admitted. That said, doing so requires new processes and practices – and this trial demonstrates such processes are feasible and safe (and almost certainly cheaper to avoid ED billing!)
https://pubmed.ncbi.nlm.nih.gov/39301600/

“Restrictive vs Liberal Transfusion Strategy in Patients With Acute Brain Injury: The TRAIN Randomized Clinical Trial”
The optimal cut-off is not known, but in this trial, the threshold for transfusion was 9g/dL – and patients randomized to the liberal transfusion strategy did better. An injured brain does not like hypotension and it does not like anemia.
https://pubmed.ncbi.nlm.nih.gov/39382241/

All the Pregnant Men in the EHR

Electronic health records, data warehouses, and data “lakes” are treasured resources in this modern era of model training. Various applications of precision medicine, “digital twins”, and other predictive mimicries depend on having the cleanest, most-accurate data feasible.

One of these data sets is “All of Us“, maintained by the National Institute of Health. Considering its wide use, the authors ask a very reasonable question: how accurate is the information contained within? Considering it is not possible to individually verify the vast scope of clinical information as applied to each person included in the data set, these authors choose what ought to be a fairly reliable surrogate: with what frequency do male and female persons included in the data set have sex-discordant diagnoses?

The authors term their measure the “incongruence rate”, and this reflects their findings of sex-specific diagnoses incongruent with the biological sex recorded. The authors iteratively refined their list and sample set, ultimately settling on 167 sex-specific conditions where there ought to be very little ambiguity – vastly those related to pregnancy and disorders of female genitalia.

Rather amazingly, their overall finding was an “incongruence rate” of 0.86% – meaning nearly 1 in 100 of these sex-specific diagnoses were found on a person of the incorrect biological sex. For example, out of 4,200 patients coded with a finding of testicular hypofunction, 44 (1.05%) were female. Or, out of 2,101 coded for a finding of prolapse of female genital organs, 21 (1%) were male. The authors also performed further analyses exploring whether cis- or trans- gender misidentification was affecting these findings, and actually note the incongruence rate rose to 0.96%.

Specifics regarding limitations or flaws in this approach aside, the key insight is that of widespread inaccuracies within electronic health data – and systematic approaches to diagnostic incongruence may be useful methods for data cleansing.

“Navigating electronic health record accuracy by examination of sex incongruent conditions”
https://pubmed.ncbi.nlm.nih.gov/39254529/

Quick Hit: Elders Risk Assessment

A few words regarding an article highlighted in one of my daily e-mails – a report regarding the Elders Risk Assessment tool (ERA) from the Mayo Clinic.

The key to the highlight is the assertion this score can be easily calculated and presented in-context to clinicians during primary care visits, allowing patients with higher scores to be easily identified for preventive interventions. With an AUC of 0.84, the authors are rather chuffed about the overall performance. In fact, they close their discussion with this rosy outlook:

The adoption of a proactive approach in primary care, along with the implementation of a predictive clinical score, could play a pivotal role in preventing critical ill- nesses, benefiting patients and optimizing healthcare resource allocation.

Completely missed by their limitations is that prognostic scores are not prescriptive. The ERA is based on age, recent hospitalizations, and chronic illness. The extent to which the management of any of these issues can be addressed “proactively” in the current primary care environment, and demonstrate a positive impact on patient-oriented outcomes, remains to be demonstrated.

To claim a scoring system is going to better the world, it is necessary to compare decisions made with formal prompting by the score to decisions made without – several steps removed from performing a retrospective evaluation to generate an AUC. It ought also be appreciated some decisions based on high ERA scores will increase resource utilization without a corresponding beneficial effect on health, while lower scores may likewise inappropriately bias clinical judgement.

This article has only passing applicability to emergency medicine, but the same issues regarding the disutility of “prognosis” apply widely.

“Individualized prediction of critical illness in older adults: Validation of an elders risk assessment model”
https://agsjournals.onlinelibrary.wiley.com/doi/abs/10.1111/jgs.18861

The United Colors of Sepsis

Here it is: sepsis writ Big Data.

And, considering it’s Big Data, it’s also a big publication: a 15 page primary publication, plus 90+ pages of online supplement – dense with figures, raw data, and methods both routine and novel for the evaluation of large data sets.

At the minimum, to put a general handle on it, this work primarily demonstrates the heterogeneity of sepsis. As any clinician knows, “sepsis” – with its ever-morphing definition – ranges widely from those generally well in the Emergency Department to those critically ill in the Intensive Care Unit. In an academic sense, this means the patients enrolled and evaluated in various trials for the treatment of sepsis may be quite different from one another, and results seen in one trial or setting may generalize poorly to another. This has obvious implications when trying to determine a general set of care guidelines from these disparate bits of data, and resulting in further issues down the road when said guidelines become enshrined in quality measures.

Overall, these authors ultimately define four phenotypes of sepsis, helpfully assigned descriptive labels using the letters of the greek alphabet. These four phenotypes of sepsis are derived from retrospective administrative data, then validated on additional retrospective administrative data, and finally the raw data from several prominent clinical trials in sepsis, including ACCESS, PROWESS, and ProCESS. The four phenotypes were derived by clustering and refinement, and are described by the authors as effectively: a mild type with low mortality; a cohort of those with chronic illness; a cohort with systemic inflammation and pulmonary disease; and a final cohort with liver dysfunction, shock, and high mortality.

We are quite far, however, from needing to apply these phenotypes in a clinical fashion. Any classification model is highly dependent upon the inputs, and in this study the inputs are the sorts of routine clinical data available from the electronic health record: vital signs, demographics, and basic labs. Missing data was common, including, for example, lactate levels, which was not obtained on 80% of patients in their model. These inputs then dictate how many different clusters you obtain, how the relative accuracy of classification diminishes with greater numbers of clusters, as well whether the model begins to overfit the derivation data set.

Then, this is a little bit of a fuzzy application in the sense these data represent as much different types of patients with sepsis, as it represents different types of sepsis. Consider the varying etiologies of sepsis, including influenza pneumonia, streptococcal toxic shock, or gram-negative bacteremia. These different etiologies would obviously result in different host responses depending on individual patient features. These phenotypes derived here effectively mash up causative agent with the underlying host, muddying clinical application.

If clinical utility is limited, then what might the best utility for this work? Well, this goes back to the idea above regarding translating work from clinical trials to different settings. A community Emergency Department might primarily see alpha-sepsis, a community ICU might see a lot of beta-sepsis, while an academic ICU might see predominantly delta-sepsis. These are important concepts to consider – and potentially subgroup-analyses to perform – when evaluating the outcomes of clinical trials. These authors do several simulations of clinical trials while varying the composition of phenotypes of sepsis, and note potentially important effects on primary outcomes. Pathways of care or resuscitation protocols could potentially be more readily compared between trial populations if these phenotypes were calculated.

This is a challenging work to process – but an important first step in better recognizing the heterogeneity in potential benefits and harms resulting from various interventions. The accompanying editorial does really a very excellent job of describing their methods, outcomes, and utility, as well.

“Derivation, Validation, and Potential Treatment Implications of Novel Clinical Phenotypes for Sepsis”
https://jamanetwork.com/journals/jama/fullarticle/2733996

“New Phenotypes for Sepsis”
https://jamanetwork.com/journals/jama/fullarticle/2733994

OK, Google: Discharge My Patient

Within my electronic health record, I have standardized discharge instructions in many languages. Many of these, I can read or edit with some fluency – such as Spanish – and those of which I have no facility whatsoever – such at Vietnamese. These function adequately for general reading material regarding any specific diagnosis made in the Emergency Department.

However, frequently, additional free text clarification is necessary regarding a treatment plan – whether it be time until suture removal, specifics about follow-up, or clarifications relevant to an individual patient. This level of language art is beyond my capacity in Spanish, let alone any sort of logographic or morphographic writing.

These authors performed a simple study in which they processed 100 free-text Emergency Department discharge instructions through the Google Translate blender to produce Spanish- and Chinese-language editions. The accuracy of the Spanish translation was 92%, as measured by the number of sentences preserving meaning and readability. Chinese fared less well, at 81%. Finally, authors assessed the errors for clinically relevant and potential harm – and found 2% of Spanish instructions and 8% of Chinese met their criteria.

Of course, there are a couple potential strategies to mitigate these potential issues – including back-translating the text from the foreign language back into English, as they did as part of these methods, or spending time verbally confirming the clarity of the written instructions with the patient. Instructions can also be improved prior to instruction by avoiding abbreviations and utilizing simple sentence structures.

Imperfect as they may be, using a translation tool is still likely better than giving no written instruction at all.

“Assessing the Use of Google Translate for Spanish and Chinese Translations of Emergency Department Discharge Instructions”
https://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2725080

Don’t Rely on the EHR to Think For You

“The Wells and revised Geneva scores can be approximated with high accuracy through the automated extraction of structured EHR data elements in patients who underwent CTPA in the emergency department.”

Can it be done? Can the computer automatically discern your intent and extract pulmonary embolism risk-stratification from the structured data? And, with “high accuracy” as these authors tout in their conclusion?

IFF:  “High accuracy” means ~90%. That means one out of every ten in their sample was misclassified as low- or high-risk for PE. This is clinically useless.

The Wells classification, of course, depends highly upon the 3 points assigned for “PE is most likely diagnosis.” So, these authors assigned 3 points positive for every case.  This sort of probably works in a population that was selected explicitly because they underwent CTPA in the ED, but is obviously a foundationally broken kludge.  Revised Geneva does not have a “gestalt” element, but there are still subjective examination features that may not make it into structured data – and, obviously, it performed just as well (poorly) as the Wells tool.

To put it mildly, these authors are overselling their work a little bit. The electronic health record will always depend on the data entered – and it’s setting itself up for failure if it depends on specific elements entered by the clinician contemporaneously during the evaluation. Tools such as these have promise – but perhaps not this specific application.

“Automated Pulmonary Embolism Risk Classification and Guideline Adherence for Computed Tomography Pulmonary Angiography Ordering”
https://onlinelibrary.wiley.com/doi/abs/10.1111/acem.13442

It’s Sepsis-Harassment!

The computer knows all in modern medicine. The electronic health record is the new Big Brother, all-seeing, never un-seeing. And it sees “sepsis” – a lot.

This is a report on the downstream effects of an electronic sepsis alert system at an academic medical center. Their sepsis alert system was based loosely on the systemic inflammatory response syndrome for the initial warning to nursing staff, followed by additional alerts triggered by hypotension or elevated lactate. These alerts prompted use of sepsis order sets or triggering of internal “sepsis alert” protocols. Their outcomes of interest in their analysis were length-of-stay and in-hospital mortality.

At first glance, the alert appears to be a success – length of stay dropped from 10.1 days to 8.6, and in-hospital mortality from 8.5% to 7.0%. It would have been quite simple to stop there and trumpet these results as favoring the alerts, but the additional analyses performed by these authors demonstrate otherwise. In the case of both length-of-stay and mortality, both of those measures were trending downward independently regardless of the intervention, and in their adjusted analyses, none of the improvements could be conclusively tied to the sepsis alerts – and some relating to diagnoses of less-severe cases of sepsis probably prompted by the alert itself.

What is not debatable, however, is the burden on clinicians and staff. During their ~2.5 year study period, the sepsis alerts were triggered 97,216 times – 14,207 of which in the 2,144 subsequently receiving a final diagnosis of sepsis. The SIRS-based alerts comprised most (83,385) of these alerts, but only captured 73% of those with an ultimate diagnosis of sepsis, while having only a 13% true positive rate. The authors’ conclusion gets it right:

Our results suggest that more sophisticated approaches to early identification of sepsis patients are needed to consistently improve patient outcomes.

“Impact of an emergency department electronic sepsis surveillance system on patient mortality and length of stay”
https://academic.oup.com/jamia/article-abstract/doi/10.1093/jamia/ocx072/4096536/Impact-of-an-emergency-department-electronic

No Change in Ordering Despite Cost Information

Everyone hates the nanny state. When the electronic health record alerts and interrupts clinicians incessantly with decision-“support”, it results in all manner of deleterious unintended consequences. Passive, contextual decision-support has the advantage of avoiding this intrusiveness – but is it effective?

It probably depends on the application, but in this trial, it was not. This is the PRICE (Pragmatic Randomized Introduction of Cost data through the Electronic health record) trial, in which 75 inpatient laboratory tests were randomized to display of usual ordering, or ordering with contextual Medicare cost information. The hope and study hypothesis was the availability of this financial interest would exert a cultural pressure of sorts on clinicians to order fewer tests, particularly those with high costs.

Across three Philadelphia-area hospitals comprising 142,921 hospital admissions in a two-year study period, there were no meaningful differences in lab tests ordered per patient day in the intervention or the control. Looking at various subgroups of patients, it is also unlikely there were particularly advantageous effects in any specific population.

Interestingly, one piece of feedback the authors report is the residents suggest most of their routine lab test ordering resulted from admission order sets. “Routine” daily labs are set in motion at the time of admission, not part of a daily assessment of need, and thus a natural impediment to improving low-value testing. However, the authors also note – and this is probably most accurate – because the cost information was displayed ubiquitously, physicians likely became numb to the intervention. It is reasonable to expect substantially more selective cost information could have focused effects on an adea of particularly high cost or low-value.

“Effect of a Price Transparency Intervention in the Electronic Health Record on Clinician Ordering of Inpatient Laboratory Tests”

http://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2619519

Oh, The Things We Can Predict!

Philip K. Dick presented us with a short story about the “precogs”, three mutants that foresaw all crime before it could occur. “The Minority Report” was written in 1956 – and, now, 60 years later we do indeed have all manner of digital tools to predict outcomes. However, I doubt Steven Spielberg will be adapting a predictive model for hospitalization for cinema.

This is a rather simple article looking at a single-center experience at using multivariate logistic regression to predict hospitalization. This differs, somewhat, from the existing art in that it uses data available at 10, 60, and 120 minutes from the arrival to the Emergency Department as the basis for its “progressive” modeling.

Based on 58,179 visits ending in discharge and 22,683 resulting in hospitalization, the specificity of their prediction method was 90% with a sensitivity or 96%,for an AUC of 0.97. Their work exceeds prior studies mostly on account of improved specificity, compared with the AUCs of a sample of other predictive models generally between 0.85 and 0.89.

Of course, their model is of zero value to other institutions as it will overfit not only on this subset of data, but also the specific practice patterns of physicians in their hospital. Their results also conceivably could be improved, as they do not actually take into account any test results – only the presence of the order for such. That said, I think it is reasonable to suggest similar performance from temporal models for predicting admission including these earliest orders and entries in the electronic health record.

For hospitals interested in improving patient flow and anticipating disposition, there may be efficiencies to be developed from this sort of informatics solution.

“Progressive prediction of hospitalisation in the emergency department: uncovering hidden patterns to improve patient flow”
http://emj.bmj.com/content/early/2017/02/10/emermed-2014-203819

Excitement and Ennui in the ED

It goes without saying some patient encounters are more energizing and rewarding than others.  As a corollary, some chief complaints similarly suck the joy out of the shift even before beginning the patient encounter.

This entertaining study simply looks for any particular time differential relating to physician self-assignment on the electronic trackboard between presenting chief complaints.  The general gist of this study would be that time-to-assignment reflects a surrogate of a composite of prioritization and/or desirability.

These authors looked at 30,382 presentations unrelated to trauma activations, and there were clear winners and losers.  This figure of the shortest and longest 10 complaints is a fairly concise summary of findings:

door to eval times

Despite consistently longer self-assignment times for certain complaints, the absolute difference in minutes is still quite small.  Furthermore, there are always issues with relying on these time stamps, particularly for higher-acuity patients; the priority of “being at the patient’s bedside” always trumps such housekeeping measures.  I highly doubt ankle sprains and finger injuries are truly seen more quickly than overdoses and stroke symptoms.

Vaginal bleeding, on the other hand … is deservedly pulling up the rear.

“Cherry Picking Patients: Examining the Interval Between Patient Rooming and Resident Self-assignment”
http://www.ncbi.nlm.nih.gov/pubmed/26874338