What are Children’s Lives Worth (to Save)?

This article regarding the cost of upgrading emergency departments to be “ready” for sick children has been bouncing around in the background since its publication, with some initial lay press coverage.

The general concept here is obviously laudable and the culmination of at least a decade of hard work from these authors and the team involved – with the ultimate goal of ensuring each emergency department in the country is capable of caring for critically unwell children. The gist of this most recent publication builds upon their prior work to, effectively, estimate the overall cost (~$200M) of improving “pediatric readiness”. Using that total cost, they then translate this into humanizing terms by referencing the total cost per child it might require in different states, and the number of pediatric lives saved annually.

As can be readily gleaned from this sort of thought experiment, these estimates rely upon a nested set of foundational assumptions, all of which are touched upon by prior work by this group. There are surveys of subsets of emergency departments regarding “readiness“, which involve questions such as the presence of pediatric-sized airway devices and staff dedicated to upkeep of various pediatric support. Then, they use these data and salary estimates to come up with the institutional costs of readiness. Then, they have another set of work looking at the odds ratios for increased poor outcomes at departments whose “readiness” is in the lowest percentiles, and this work is extrapolated to determine the lives saved.

Each of these pieces of work, in isolation, is reasonable, but represents a bit of a house of cards. The likelihood of imprecision is magnified as the estimates are combined. For example, how direct is the correlation between “readiness” based on certain equipment and pediatric survival, if the ED in question is a critical access hospital with low annual census? Is the cost of true clinical readiness just a part-time FTE of a nurse, or should it realistically involve the costs of skill upkeep for nurses and physicians with education or simulation?

I suspect, overall, these data understate the costs and overstate the return on investment. That said, this is still critical work even just to describe the landscape and take a stab at the scope of funding required. Likely, the best next step would be to target specific profiles of institutions, and specific types of investment, where such investment is likely to have the highest yield – as a first step on the journey towards universal readiness.

“State and National Estimates of the Cost of Emergency Department Pediatric Readiness and Lives Saved”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825748

Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395