Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395