The Hand that Feeds the Hand that Feeds the Hand that ….

Pharmaceutical development is all about the blockbuster drug. Many of our brightest minds are research scientists and bioinformaticians working at translating in vitro discoveries to improving the lives of human kind.

Many of our brightest minds are also working to ensure, even if their drug candidates are – say – a little flawed, there’s a clinical trial design, trial implementation, and post-approval marketing plan to maximize return to shareholders.

A critical portion of in this chain of survival is publication – the higher the impact a journal, the better. One of the gatekeepers to publication remains peer review, a critical step in ensuring the integrity, transparency, and reproducibility of science. Naturally, this step would be independent and untainted by bias, a vigilant final guardian protecting the public.

It would … wouldn’t it?

This brief report from JAMA finds it would absurd even to imagine such a fanciful state of affairs. Evaluating lists of peer reviewers from The BMJ, JAMA, The Lancet, and The New England Journal of Medicine from 2022, the authors analyzed 1,962 U.S.-based physician reviewers. Of these reviewers, the Open Payments database indicated 58.9% of these reviewers had received payments from industry, totalling USD$1.06 billion between 2020 and 2022. The vast majority – $1.01 billion – were payments supporting research activities, with a mere $60M going towards such things as consulting fees, speaker fees, honoraria, etc. Male reviewers and those in medical and surgical specialties, rather than primary care or hospital-based specialties, were the dominant recipients of said payments.

While these data are not meant to illuminate some sort of dark money ecosystem, it is clear the “peers” doing the reviews are playing at the same game. There is going to be an obvious bias towards allowing publication of content and spin consistent with the output the reviewers themselves would anticipate using in their own work. A receptive audience, if you would.

Just another happy reminder how so much of our medical practice is swept along in a current powered by many moneyed forces at work.

“Payments by Drug and Medical Device Manufacturers to US Peer Reviewers
of Major Medical Journals”
https://jamanetwork.com/journals/jama/article-abstract/2824834

Publication Potpourri

Clearing the backlog of mildly interesting articles that will never get a full write-up – here’s a quick hit of the most interesting lucky 13!

“Lactated Ringer vs Normal Saline Solution During Sickle Cell Vaso-Occlusive Episodes”
A “target trial emulation” providing observational evidence supporting the superiority of lactated ringers solution over normal saline for the resuscitation of patients being admitted with sickle cell crises. However, only a small fraction of patients actually received LR, and those who did received smaller amounts overall. Frankly, it’s hard to separate these observations from the general concern euvolemic SCD patients are simply receiving far too much fluid during admission.
https://pubmed.ncbi.nlm.nih.gov/39250114/

“Empathy and clarity in GPT-4-Generated Emergency Department Discharge Letters”
The cold, unfeeling computer generates better feels than emergency clinicians. Bland treacle, no doubt, but generic bland treacle will beat the terse output of a distracted human any day. The question now is how to combine the two to create a better human, where suitable.
https://www.medrxiv.org/content/10.1101/2024.10.07.24315034v1

“Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG- enhanced vs. Human-Generated Medical Templates”
Study number n-million demonstrating retrieval-augmented generation – that is to say, using a relatively sophisticated series of prompts to an LLM – improves output and reduces “hallucinations”. In this specific instance, the LLM was effectively mimicking the output of BMJ Best Practice templates.
https://www.medrxiv.org/content/10.1101/2024.09.27.24314506v1

“Towards Democratization of Subspeciality Medical Expertise”
This is one of the AIME demonstration projects, a Google Deepmind output in which an additional layer of “conversational” training has been superimposed upon the underlying model to improve response to consultation-style questions. In this example, the conversational training is tuned to match the response style of genetic cardiology specialists from Stanford University – and the LLM content and style was arguably rated identically to the human specialists.
https://arxiv.org/abs/2410.03741

“The Limits of Clinician Vigilance as an AI Safety Bulwark”
A nice little commentary effectively articulating the elephant in the room: humans are terrible at “vigilance”. The present state of AI/LLM deployment in clinical medicine is replete with omissions and inaccuracies, and it’s not reasonable to simply trust clinicians to catch the mistakes. That said, the five suggested strategies to address vigilance seem … precarious.
https://jamanetwork.com/journals/jama/fullarticle/2816582

“Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest”
Who wins an essay competition on controversial topics in stroke and neurology – humans or LLMs? And can reviewers accurately guess which essays are human- versus LLM-generated? The answer, rather distressingly, is that reviewers mostly couldn’t distinguish between author types, and LLM composition quality was arguably higher than human.
https://pubmed.ncbi.nlm.nih.gov/39224979/

“Invasive Treatment Strategy for Older Patients with Myocardial Infarction”
The amusingly named “SENIOR-RITA” trial in which elderly patients with NSTEMI were randomized to an invasive strategy versus a medical management strategy. While this may seem odd to those in the U.S., frailty and baseline life-expectancy are typical considerations for acute care in other countries. In these frail elderly, invasive strategies reduce downstream non-fatal MI, but had no effect on cardiovascular death or the overall composite outcome.
https://pubmed.ncbi.nlm.nih.gov/39225274/

“Thrombolysis After Dabigatran Reversal for Acute Ischemic Stroke: A National Registry-Based Study and Meta-Analysis”
Just another bit from our neurology colleagues patting themselves on the back for doing backflips in order to give everyone thrombolysis. Registry data paired with a garbage-in-garbage-out systematic review and meta-analysis just amplifies the biases prevalent in the underlying practice culture.
https://pubmed.ncbi.nlm.nih.gov/39255429/

“Tenecteplase vs Alteplase for Patients With Acute Ischemic Stroke: The ORIGINAL Randomized Clinical Trial”
This trial makes a claim to be ORIGINAL, but at this point – there’s really no general question remaining whether tenecteplase is a valid alternative to alteplase. It is reasonable to test specific doses in various populations with a goal of minimizing harms, of course.
https://pubmed.ncbi.nlm.nih.gov/39264623/

“Technology-Supported Self-Triage Decision Making: A Mixed-Methods Study”
Can laypersons improve their self-triage decision-making with use of technology? This little preprint tests the Ada Health “symptom checker” app against interaction with a ChatGPT LLM and favors use of the symptom checker. Not hardly rigorous enough to discard the chatbot as a possible tool, but certainly the LLM needs more prompt engineering and/or domain-specific training than just “out of the box”.
https://www.medrxiv.org/content/10.1101/2024.09.12.24313558v1

“Class I Recalls of Cardiovascular Devices Between 2013 and 2022 : A Cross-Sectional Analysis”
A brief report looking at recalled cardiovascular devices is insufficient to make any broad conclusions, but certainly demonstrates the regulatory bar for approval is inadequate. Most did not require pre-market clinical testing, and those that did used surrogate or composite endpoints to support approval.
https://pubmed.ncbi.nlm.nih.gov/39284187/

“Effectiveness of Direct Admission Compared to Admission Through the Emergency Department: A Stepped-Wedge Cluster-Randomized Trial”
A bit of an odd question whether a patient with known requirement for admission needs to stop through the emergency department, or whether the patient can go straight to the ward. Unless a patient requires immediate resuscitation with the resources of an ED, it is very clearly appropriate for a patient to be directly admitted. That said, doing so requires new processes and practices – and this trial demonstrates such processes are feasible and safe (and almost certainly cheaper to avoid ED billing!)
https://pubmed.ncbi.nlm.nih.gov/39301600/

“Restrictive vs Liberal Transfusion Strategy in Patients With Acute Brain Injury: The TRAIN Randomized Clinical Trial”
The optimal cut-off is not known, but in this trial, the threshold for transfusion was 9g/dL – and patients randomized to the liberal transfusion strategy did better. An injured brain does not like hypotension and it does not like anemia.
https://pubmed.ncbi.nlm.nih.gov/39382241/

Let ChatGPT Guide Your Hand

This exploration of LLMs in the emergency department is a bit unique in its conceptualization. While most demonstrations of generative AI applied to the ED involve summarization of records, digital scribing, or composing discharge letters, this attempts clinical decision-support. That is to say, rather than attempting to streamline or deburden clinicians from some otherwise time-intensive task, the LLM here taps into its ability to act as a generalized prediction engine – and tries its hand at prescriptive recommendations.

Specifically, the LLM – here GPT-3.5T and GPT-4T – is asked:

  • Should this patient be admitted to the hospital?
  • Does this patient require radiologic investigations?
  • Does this patient require antibiotics?

Considering we’ve seen general LLMs perform admirably on various medical licensing examinations, ought not these tools be able to get the meat off the bone in real life?

Before even considering the results, there are multiple fundamental considerations taking this published exploration into the realm of curiosity rather than insightfulness:

  • This communication was submitted in Oct 2023 – meaning the LLMs used, while modern at the time, are debatably becoming obsolete. Likewise, the prompting methods are a bit simplistic and anachronistic – evidence has shown advantage to carefully constructed augmented retrieval instructions.
  • The LLM was fed solely physician clinical notes – specifically the “clinical history”, “examination”, and “assessment/plan”. The LLM was therefore generating responses based on, effectively, an isolated completed medical assessment of a patient. This method excludes other data present in the record (vital signs, laboratory results, etc.), while also relying upon finished human documentation for its “decision-support”.
  • The prompts – “should”/”does” – replicate the intent of the decision-support concept of the exploration, but not the retrospective nature of the content. Effectively, what ought to have been asked of the LLMs – and the clinician reviewers – was “did this patient get admitted to the hospital?” or “did this patient receive antibiotics?” It would be mildly interesting to shift the question away from a somewhat subjective value judgement to a bit of an intent inference exercise.
  • The clinician reviewers – one resident physician and one attending physician – did not much agree (73-83% agreement) on admission, radiology, and antibiotic determinations. It becomes very difficult to evaluate any sort of predictive or prescriptive intervention when the “gold standard” is so diaphanous. There is truly no accepted “gold standard” for these sorts of questions, as individual clinician admission rates and variations in practice are exceedingly wide. This is evidenced by the general inaccuracy displayed by just these two clinicians, whose own individual accuracy ranged from 74-83%, on top of that poor agreement.

Now, after scratching the tip of the methodology and translation iceberg, the results: unusable.

GPT-4T, as to expected, outperformed GPT-3.5T. But, regardless of LLM prompted, there were clear patterns of inadequacy. Each LLM was quite sensitive in its prescription of admission or radiologic evaluation – but at the extreme sacrifice of specificity, with “false positives” nearly equalling the “true positives” in some cases. The reverse was true for antibiotic prescription, with a substantial drop in sensitivity, but improved specificity. For what its worth, of course, U.S. emergency departments are general cesspools of aggressive empiric antibiotic coverage, driven by CMS regulations – so it may in fact be the LLM displaying astute clinical judgement, here. The “sepsis measure fallout gestapo” might disagree, however.

I can envision this approach is not entirely hopeless. The increasing use of LLM digital scribes is likely to improve early data available to such predictive or prescriptive models. Other structured clinical data collected by electronic systems may be incorporated. Likewise, there are other clinical notes of value potentially available, including nursing and triage documentation. I don’t hardly find this to be a dead-end idea, at all, but the limitations of this exploration don’t shed much light except to direct future efforts.

“Evaluating the use of large language models to provide clinical recommendations in the Emergency Department”
https://www.nature.com/articles/s41467-024-52415-1