What are Children’s Lives Worth (to Save)?

This article regarding the cost of upgrading emergency departments to be “ready” for sick children has been bouncing around in the background since its publication, with some initial lay press coverage.

The general concept here is obviously laudable and the culmination of at least a decade of hard work from these authors and the team involved – with the ultimate goal of ensuring each emergency department in the country is capable of caring for critically unwell children. The gist of this most recent publication builds upon their prior work to, effectively, estimate the overall cost (~$200M) of improving “pediatric readiness”. Using that total cost, they then translate this into humanizing terms by referencing the total cost per child it might require in different states, and the number of pediatric lives saved annually.

As can be readily gleaned from this sort of thought experiment, these estimates rely upon a nested set of foundational assumptions, all of which are touched upon by prior work by this group. There are surveys of subsets of emergency departments regarding “readiness“, which involve questions such as the presence of pediatric-sized airway devices and staff dedicated to upkeep of various pediatric support. Then, they use these data and salary estimates to come up with the institutional costs of readiness. Then, they have another set of work looking at the odds ratios for increased poor outcomes at departments whose “readiness” is in the lowest percentiles, and this work is extrapolated to determine the lives saved.

Each of these pieces of work, in isolation, is reasonable, but represents a bit of a house of cards. The likelihood of imprecision is magnified as the estimates are combined. For example, how direct is the correlation between “readiness” based on certain equipment and pediatric survival, if the ED in question is a critical access hospital with low annual census? Is the cost of true clinical readiness just a part-time FTE of a nurse, or should it realistically involve the costs of skill upkeep for nurses and physicians with education or simulation?

I suspect, overall, these data understate the costs and overstate the return on investment. That said, this is still critical work even just to describe the landscape and take a stab at the scope of funding required. Likely, the best next step would be to target specific profiles of institutions, and specific types of investment, where such investment is likely to have the highest yield – as a first step on the journey towards universal readiness.

“State and National Estimates of the Cost of Emergency Department Pediatric Readiness and Lives Saved”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825748

Get the “Human-Out-Of-The-Way”

It is clear LLMs have an uncanny ability to find associations between salient features, and to subsequently use those associations to generate accurate probabilistic lists of medical diagnoses. On top of that, it can take those diagnoses and use its same probabilistic functions to mimic the explanations it has seen in its training set. A powerful tool – clearly an asset to patient care?

Testing its ability as a clinical augment, these researchers divvied up a collection of mostly internal medicine and emergency medicine attendings and residents. On one side, clinicians were asked to assess a set of clinical vignettes with only their usual reference tools. On the other side, clinicians were also allowed to access GPT-4. The assessment associated with the clinical vignettes asked participants to rank their top three potential diagnoses, to justify their thinking, and to make a plan for follow-up diagnostic evaluation.

At the end of the day – no advantage to GPT-4. Clinicians using standard reference materials scored a median of 74% per assessment while those with access to GPT-4 scored a median of 76%. Clinicians using the LLM were generally a few seconds faster.

So – perhaps a narrow win for clinical deployment of LLMs? A little faster and no worse? Lessons to be learned regarding how clinicians might integrate their use into care? A clue we need to develop better domain-specific LLM training for best results?

Tucked away in the text as almost an offhand remark, the authors also tested the LLM alone – and the LLM scored 92% without a “human-in-the-loop”. A crushing blow.

Curiously, much of the early discussion regarding mitigating the potential harms from LLMs relies upon “human-in-the-loop” vigilance – but instances may increasingly be found in which human involvement is deleterious. Implementation of various clinical augments might require more than comparing “baseline performance” versus “human-in-the-loop”, but also adding a further line of evaluation for “human-out-of-the-way” where the augmentation tool is relatively unhindered.

“Large Language Model Influence on Diagnostic Reasoning”
https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2825395

The Hand that Feeds the Hand that Feeds the Hand that ….

Pharmaceutical development is all about the blockbuster drug. Many of our brightest minds are research scientists and bioinformaticians working at translating in vitro discoveries to improving the lives of human kind.

Many of our brightest minds are also working to ensure, even if their drug candidates are – say – a little flawed, there’s a clinical trial design, trial implementation, and post-approval marketing plan to maximize return to shareholders.

A critical portion of in this chain of survival is publication – the higher the impact a journal, the better. One of the gatekeepers to publication remains peer review, a critical step in ensuring the integrity, transparency, and reproducibility of science. Naturally, this step would be independent and untainted by bias, a vigilant final guardian protecting the public.

It would … wouldn’t it?

This brief report from JAMA finds it would absurd even to imagine such a fanciful state of affairs. Evaluating lists of peer reviewers from The BMJ, JAMA, The Lancet, and The New England Journal of Medicine from 2022, the authors analyzed 1,962 U.S.-based physician reviewers. Of these reviewers, the Open Payments database indicated 58.9% of these reviewers had received payments from industry, totalling USD$1.06 billion between 2020 and 2022. The vast majority – $1.01 billion – were payments supporting research activities, with a mere $60M going towards such things as consulting fees, speaker fees, honoraria, etc. Male reviewers and those in medical and surgical specialties, rather than primary care or hospital-based specialties, were the dominant recipients of said payments.

While these data are not meant to illuminate some sort of dark money ecosystem, it is clear the “peers” doing the reviews are playing at the same game. There is going to be an obvious bias towards allowing publication of content and spin consistent with the output the reviewers themselves would anticipate using in their own work. A receptive audience, if you would.

Just another happy reminder how so much of our medical practice is swept along in a current powered by many moneyed forces at work.

“Payments by Drug and Medical Device Manufacturers to US Peer Reviewers
of Major Medical Journals”
https://jamanetwork.com/journals/jama/article-abstract/2824834

Publication Potpourri

Clearing the backlog of mildly interesting articles that will never get a full write-up – here’s a quick hit of the most interesting lucky 13!

“Lactated Ringer vs Normal Saline Solution During Sickle Cell Vaso-Occlusive Episodes”
A “target trial emulation” providing observational evidence supporting the superiority of lactated ringers solution over normal saline for the resuscitation of patients being admitted with sickle cell crises. However, only a small fraction of patients actually received LR, and those who did received smaller amounts overall. Frankly, it’s hard to separate these observations from the general concern euvolemic SCD patients are simply receiving far too much fluid during admission.
https://pubmed.ncbi.nlm.nih.gov/39250114/

“Empathy and clarity in GPT-4-Generated Emergency Department Discharge Letters”
The cold, unfeeling computer generates better feels than emergency clinicians. Bland treacle, no doubt, but generic bland treacle will beat the terse output of a distracted human any day. The question now is how to combine the two to create a better human, where suitable.
https://www.medrxiv.org/content/10.1101/2024.10.07.24315034v1

“Mitigating Hallucinations in Large Language Models: A Comparative Study of RAG- enhanced vs. Human-Generated Medical Templates”
Study number n-million demonstrating retrieval-augmented generation – that is to say, using a relatively sophisticated series of prompts to an LLM – improves output and reduces “hallucinations”. In this specific instance, the LLM was effectively mimicking the output of BMJ Best Practice templates.
https://www.medrxiv.org/content/10.1101/2024.09.27.24314506v1

“Towards Democratization of Subspeciality Medical Expertise”
This is one of the AIME demonstration projects, a Google Deepmind output in which an additional layer of “conversational” training has been superimposed upon the underlying model to improve response to consultation-style questions. In this example, the conversational training is tuned to match the response style of genetic cardiology specialists from Stanford University – and the LLM content and style was arguably rated identically to the human specialists.
https://arxiv.org/abs/2410.03741

“The Limits of Clinician Vigilance as an AI Safety Bulwark”
A nice little commentary effectively articulating the elephant in the room: humans are terrible at “vigilance”. The present state of AI/LLM deployment in clinical medicine is replete with omissions and inaccuracies, and it’s not reasonable to simply trust clinicians to catch the mistakes. That said, the five suggested strategies to address vigilance seem … precarious.
https://jamanetwork.com/journals/jama/fullarticle/2816582

“Reviewer Experience Detecting and Judging Human Versus Artificial Intelligence Content: The Stroke Journal Essay Contest”
Who wins an essay competition on controversial topics in stroke and neurology – humans or LLMs? And can reviewers accurately guess which essays are human- versus LLM-generated? The answer, rather distressingly, is that reviewers mostly couldn’t distinguish between author types, and LLM composition quality was arguably higher than human.
https://pubmed.ncbi.nlm.nih.gov/39224979/

“Invasive Treatment Strategy for Older Patients with Myocardial Infarction”
The amusingly named “SENIOR-RITA” trial in which elderly patients with NSTEMI were randomized to an invasive strategy versus a medical management strategy. While this may seem odd to those in the U.S., frailty and baseline life-expectancy are typical considerations for acute care in other countries. In these frail elderly, invasive strategies reduce downstream non-fatal MI, but had no effect on cardiovascular death or the overall composite outcome.
https://pubmed.ncbi.nlm.nih.gov/39225274/

“Thrombolysis After Dabigatran Reversal for Acute Ischemic Stroke: A National Registry-Based Study and Meta-Analysis”
Just another bit from our neurology colleagues patting themselves on the back for doing backflips in order to give everyone thrombolysis. Registry data paired with a garbage-in-garbage-out systematic review and meta-analysis just amplifies the biases prevalent in the underlying practice culture.
https://pubmed.ncbi.nlm.nih.gov/39255429/

“Tenecteplase vs Alteplase for Patients With Acute Ischemic Stroke: The ORIGINAL Randomized Clinical Trial”
This trial makes a claim to be ORIGINAL, but at this point – there’s really no general question remaining whether tenecteplase is a valid alternative to alteplase. It is reasonable to test specific doses in various populations with a goal of minimizing harms, of course.
https://pubmed.ncbi.nlm.nih.gov/39264623/

“Technology-Supported Self-Triage Decision Making: A Mixed-Methods Study”
Can laypersons improve their self-triage decision-making with use of technology? This little preprint tests the Ada Health “symptom checker” app against interaction with a ChatGPT LLM and favors use of the symptom checker. Not hardly rigorous enough to discard the chatbot as a possible tool, but certainly the LLM needs more prompt engineering and/or domain-specific training than just “out of the box”.
https://www.medrxiv.org/content/10.1101/2024.09.12.24313558v1

“Class I Recalls of Cardiovascular Devices Between 2013 and 2022 : A Cross-Sectional Analysis”
A brief report looking at recalled cardiovascular devices is insufficient to make any broad conclusions, but certainly demonstrates the regulatory bar for approval is inadequate. Most did not require pre-market clinical testing, and those that did used surrogate or composite endpoints to support approval.
https://pubmed.ncbi.nlm.nih.gov/39284187/

“Effectiveness of Direct Admission Compared to Admission Through the Emergency Department: A Stepped-Wedge Cluster-Randomized Trial”
A bit of an odd question whether a patient with known requirement for admission needs to stop through the emergency department, or whether the patient can go straight to the ward. Unless a patient requires immediate resuscitation with the resources of an ED, it is very clearly appropriate for a patient to be directly admitted. That said, doing so requires new processes and practices – and this trial demonstrates such processes are feasible and safe (and almost certainly cheaper to avoid ED billing!)
https://pubmed.ncbi.nlm.nih.gov/39301600/

“Restrictive vs Liberal Transfusion Strategy in Patients With Acute Brain Injury: The TRAIN Randomized Clinical Trial”
The optimal cut-off is not known, but in this trial, the threshold for transfusion was 9g/dL – and patients randomized to the liberal transfusion strategy did better. An injured brain does not like hypotension and it does not like anemia.
https://pubmed.ncbi.nlm.nih.gov/39382241/

Let ChatGPT Guide Your Hand

This exploration of LLMs in the emergency department is a bit unique in its conceptualization. While most demonstrations of generative AI applied to the ED involve summarization of records, digital scribing, or composing discharge letters, this attempts clinical decision-support. That is to say, rather than attempting to streamline or deburden clinicians from some otherwise time-intensive task, the LLM here taps into its ability to act as a generalized prediction engine – and tries its hand at prescriptive recommendations.

Specifically, the LLM – here GPT-3.5T and GPT-4T – is asked:

  • Should this patient be admitted to the hospital?
  • Does this patient require radiologic investigations?
  • Does this patient require antibiotics?

Considering we’ve seen general LLMs perform admirably on various medical licensing examinations, ought not these tools be able to get the meat off the bone in real life?

Before even considering the results, there are multiple fundamental considerations taking this published exploration into the realm of curiosity rather than insightfulness:

  • This communication was submitted in Oct 2023 – meaning the LLMs used, while modern at the time, are debatably becoming obsolete. Likewise, the prompting methods are a bit simplistic and anachronistic – evidence has shown advantage to carefully constructed augmented retrieval instructions.
  • The LLM was fed solely physician clinical notes – specifically the “clinical history”, “examination”, and “assessment/plan”. The LLM was therefore generating responses based on, effectively, an isolated completed medical assessment of a patient. This method excludes other data present in the record (vital signs, laboratory results, etc.), while also relying upon finished human documentation for its “decision-support”.
  • The prompts – “should”/”does” – replicate the intent of the decision-support concept of the exploration, but not the retrospective nature of the content. Effectively, what ought to have been asked of the LLMs – and the clinician reviewers – was “did this patient get admitted to the hospital?” or “did this patient receive antibiotics?” It would be mildly interesting to shift the question away from a somewhat subjective value judgement to a bit of an intent inference exercise.
  • The clinician reviewers – one resident physician and one attending physician – did not much agree (73-83% agreement) on admission, radiology, and antibiotic determinations. It becomes very difficult to evaluate any sort of predictive or prescriptive intervention when the “gold standard” is so diaphanous. There is truly no accepted “gold standard” for these sorts of questions, as individual clinician admission rates and variations in practice are exceedingly wide. This is evidenced by the general inaccuracy displayed by just these two clinicians, whose own individual accuracy ranged from 74-83%, on top of that poor agreement.

Now, after scratching the tip of the methodology and translation iceberg, the results: unusable.

GPT-4T, as to expected, outperformed GPT-3.5T. But, regardless of LLM prompted, there were clear patterns of inadequacy. Each LLM was quite sensitive in its prescription of admission or radiologic evaluation – but at the extreme sacrifice of specificity, with “false positives” nearly equalling the “true positives” in some cases. The reverse was true for antibiotic prescription, with a substantial drop in sensitivity, but improved specificity. For what its worth, of course, U.S. emergency departments are general cesspools of aggressive empiric antibiotic coverage, driven by CMS regulations – so it may in fact be the LLM displaying astute clinical judgement, here. The “sepsis measure fallout gestapo” might disagree, however.

I can envision this approach is not entirely hopeless. The increasing use of LLM digital scribes is likely to improve early data available to such predictive or prescriptive models. Other structured clinical data collected by electronic systems may be incorporated. Likewise, there are other clinical notes of value potentially available, including nursing and triage documentation. I don’t hardly find this to be a dead-end idea, at all, but the limitations of this exploration don’t shed much light except to direct future efforts.

“Evaluating the use of large language models to provide clinical recommendations in the Emergency Department”
https://www.nature.com/articles/s41467-024-52415-1

BIPAP IPAP: Higher is Better?

The cornerstone of treatment for severe exacerbations of chronic obstructive pulmonary disease remains non-invasive positive pressure ventilation. Typically, this involves bi-level positive pressure settings, preventing alveolar collapse while assisting with inspiration and gas exchange. This works – most of the time. When it doesn’t work – endotracheal intubation.

This trial, the HAPPEN trial, looks at a little bit different approach. If typical BIPAP settings aren’t working, why not just IPAP harder?

In this unblinded, randomized-controlled trial, patients with acute exacerbations of COPD received traditional NIPPV with inspiratory pressures <18 cmH20 or “high-intensity” NIPPV, with airway pressures titrated up to 20-30 cmH20. The equipoise from this trial comes from a body of literature in which this “high-intensity” treatment paradigm has been used to improve respiratory physiology on an intermittent, outpatient basis. Extending these data to the acute setting, these authors aimed to use “high-intensity” NIPPV in an attempt to decrease rates of endotracheal intubation.

As per the authors report, the trial was a “success” – indeed, such a “success” it was stopped early for benefit. Among the ~150 patients in each arm, those randomized to “high-intensity” NIPPV improved their primary outcome of “need for endotracheal intubation” – 4.8% in the “high-intensity” cohort (IPAP settings of ~25 cmH2O) versus 13.7% in the “low-intensity” cohort (IPAP settings of ~18 cmH2O). These observations were durable in both unadjusted and adjusted results, as well as across most pre-specified subgroups.

However, “need for endotracheal intubation” was not the same as “received endotracheal intubation” – “need for endotracheal intubation” was a composite outcome of worsening acidosis, worsening clinical status, or respiratory arrest. In actuality, the number of patients intubated in each group were the same – 3.4% and 3.9% in each group. Crossover from “low-intensity” to “high-intensity” was permitted, however, and may have reduced the number of “low-intensity” patients requiring intubation. However, the unblinded nature of the trial confounds a bit of interpretation of the influence of the crossover event. There were more safety and adverse outcomes associated with the “high-intensity” cohort, with “abdominal distension” and intolerance of NIPPV the most frequent safety excess, and “severe alkalosis” the most frequent “serious adverse event”.

While this seems an interesting and plausible idea for treating COPD, this trial has almost-zero applicability to the emergency department. Patients enrolled in this trial were effectively “subacute” COPD inpatients whose length of illness was a median of 6 days at time of enrollment, had already been on “low-intensity” NIPPV for at least six hours, had near-normal mean pH of 7.31, and mean respiratory rate of 22 at time of randomization. These were, effectively, stable-but-not-fully-improved inpatients at a small risk for further deterioration, not the acute respiratory distress seen in the ED.

This is still an interesting idea, though, and there is likely equipoise to test NIPPV with higher inspiratory pressures in the ED. Tidal volumes, dyspnea scores, and accessory muscle use were all superior at higher IPAP settings, and may confer benefit in the acute setting. That said, I would not implement this practice without sufficient evidence from an ED setting, as the signals of IPAP intolerance may be as problematic in a patient-oriented sense as the perceived physiologic advantages.

“Effect of High-Intensity vs Low-Intensity Noninvasive Positive Pressure
Ventilation on the Need for Endotracheal Intubation in Patients With an Acute Exacerbation of Chronic Obstructive Pulmonary Disease”
https://jamanetwork.com/journals/jama/article-abstract/2823763

All the Pregnant Men in the EHR

Electronic health records, data warehouses, and data “lakes” are treasured resources in this modern era of model training. Various applications of precision medicine, “digital twins”, and other predictive mimicries depend on having the cleanest, most-accurate data feasible.

One of these data sets is “All of Us“, maintained by the National Institute of Health. Considering its wide use, the authors ask a very reasonable question: how accurate is the information contained within? Considering it is not possible to individually verify the vast scope of clinical information as applied to each person included in the data set, these authors choose what ought to be a fairly reliable surrogate: with what frequency do male and female persons included in the data set have sex-discordant diagnoses?

The authors term their measure the “incongruence rate”, and this reflects their findings of sex-specific diagnoses incongruent with the biological sex recorded. The authors iteratively refined their list and sample set, ultimately settling on 167 sex-specific conditions where there ought to be very little ambiguity – vastly those related to pregnancy and disorders of female genitalia.

Rather amazingly, their overall finding was an “incongruence rate” of 0.86% – meaning nearly 1 in 100 of these sex-specific diagnoses were found on a person of the incorrect biological sex. For example, out of 4,200 patients coded with a finding of testicular hypofunction, 44 (1.05%) were female. Or, out of 2,101 coded for a finding of prolapse of female genital organs, 21 (1%) were male. The authors also performed further analyses exploring whether cis- or trans- gender misidentification was affecting these findings, and actually note the incongruence rate rose to 0.96%.

Specifics regarding limitations or flaws in this approach aside, the key insight is that of widespread inaccuracies within electronic health data – and systematic approaches to diagnostic incongruence may be useful methods for data cleansing.

“Navigating electronic health record accuracy by examination of sex incongruent conditions”
https://pubmed.ncbi.nlm.nih.gov/39254529/

The End of Respiratory Season Hell?

Every year, we have our peak of respiratory viruses – traditionally influenza, respiratory syncytial virus, and their accompanying lessor demons. These are each awful, of course, in their own way from a patient- and parent-oriented standpoint, but they’re also quite awful at the population level, overburdening limited pediatric and emergency department resources. RSV, in particular, is a vicious scourge of young and vulnerable infants.

The story told here – sponsored by Sanofi – is one of nirsevimab, a monoclonal antibody protein featured last year in the NEJM. Nirsevimab is the evolution of palivizumab, previously approved and used as a multi-injection prophylaxis scheme for the highest-risk infants. The generally established advantages of nirsevimab over palivizumab are higher and longer levels of neutralizing antibodies, requiring only a single injection rather than a multi-dose course. Nirsevimab has been recommended by the Advisory Committee on Immunization Practices for infants in the U.S. since August 2023.

These data give us a wee look at what happens to a country that adopts such a practice of wide immunization with nirsevimab – Spain! These authors compare the burden of lower respiratory tract infections and bronchiolitis admissions at 15 pediatric emergency departments across 2018 to 2024, with 2023-24 being the first season where nirsevimab was in wide usage. Most regions used a strategy in which nirsevimab was provided to new births during RSV season, as well as other young infants born prior to the onset of the season. The two “COVID seasons” of 2020-21 and 2021-22 were excluded from their comparisons.

Generally speaking, the administration of nirsevimab diminished lower respiratory tract infection presentations, bronchiolitis presentations, and bronchiolitis admission by approximately 60% as compared to prior years. The overall effect of these reductions in bronchiolitis presentations had the net effect of decreasing all presentations to the ED by about 20%. I suspect virtually every emergency department and PICU out there would prefer this sort of an experience each winter.

The catch: nirsevimab costs ~USD$500 per dose. The initial ACIP cost-effectiveness evaluations were based on the assumption nirsevimab would be priced at ~$300 a dose, at which point it was considered cost-effective. Obviously, $500 is more than $300 – and thus it becomes a robust debate which infants should be offered nirsevimab with many inputs, assumptions, and remaining uncertainties. The promise is certainly out there, however, of dramatically improved respiratory virus seasons for those working in the emergency department.

“Nirsevimab and Acute Bronchiolitis Episodes in Pediatric Emergency Departments”
https://publications.aap.org/pediatrics/article/doi/10.1542/peds.2024-066584/199339/Nirsevimab-and-Acute-Bronchiolitis-Episodes-in

In: Dexamethasone, Out: Prednisone

Move over ketamine and TXA, there’s another medication gradually approaching do-it-all darling status in Emergency Medicine: dexamethasone.

Sore throats?

Croup?

Headaches?

Non-specific aches?

Well, yes to all of the above, in the appropriate clinical context –

But, most prominently, as featured in this brief report, for asthma – particularly childhood asthma.

It’s NHAMCS – so it is representative data being transformed into rough weighted estimates – between 2010-2021, but we see its use increasing from 3.5% of asthma visits to 17.3% – and the rates are over double that in children.

A heartening trend for a simpler administration and adherence strategy – and non-inferior, overall, while entirely reasonable to judiciously select a higher-risk patient in whom it might be plausible to prescribe prednisone/prednisolone instead.

“Trends in dexamethasone treatment for asthma in U.S. emergency departments”
https://onlinelibrary.wiley.com/doi/full/10.1111/acem.14997

The Andexxa Showpiece

Every so often a masterclass performance arises in the medical literature. A performance transcending the boundaries of what was once thought possible. A shining exemplar of human achievement.

This is a trial, published in the New England Journal of Medicine, with the following features:

  • Conducted by an institute sponsored by pharma.
  • Designed by the first author, a consultant for pharma, and two employees of pharma.
  • Written by a medical writer employed by pharma.
  • Replete with authors reporting multiple financial conflicts of interest with pharma.
  • Substantially modified trial procedures and outcomes two and three years into the trial.
  • Introduced an interim stopping rule whose analysis was performed by an unblinded statistician affiliated with the funded institute.
  • Stopped the trial early based on the new interim stopping rule.
  • Used a surrogate composite primary endpoint.
  • Allowed the “usual care” arm to include patients who did not receive an active treatment comparator.
  • Permitted discrepancies in the baseline characteristics favoring the experimental arm.

And, this is solely the reported mechanisms by which pharma has placed their hands on the scales of this trial. It ought to be quite clear these procedures were carefully designed to ensure the (financial) success of this trial, and its ultimate publication is virtually an advertorial for the product in question.

The culprit this go-around? AstraZeneca née Alexion née Portola for Andexxa – better known as “andexanet alfa” (even though the FDA declined their drug naming for this label, properly known as “Coagulation Factor Xa [Recombinant], Inactivated-zhzo”). The trial is ANNEXA-I, which purports to be a comparison between Andexxa and Prothrombin Concentrate Complexes.

As alluded to above, this trial was not designed to permit Andexxa to fail. With Andexxa sales climbing and approaching $200M annually, it is obviously impermissible to allow a trial to offer a hint of doubt – especially considering Portola/Alexion/AstraZeneca have been investing in “expert guidelines” aimed at elevating Andexxa above PCCs as first-line treatment for Factor Xa-associated bleeding.

So – naturally, Andexxa “succeeds”. On the composite endpoint of “good hemostatic efficacy” – hematoma volume change < 35%, NIHSS change < 7 points, and no use of rescue therapy between 3-12 hours – Andexxa outperformed “usual care” by 13.4%, 67.0% to 53.1%. The primary limiting factor to this composite endpoint was the sub-endpoint of hematoma volume change of < 35%. And, as this composite favours Andexxa, the trial was stopped early – and the favorable press releases roll in. Ideally, this is the point at which our sponsors would like us to stop further analysis and critique.

Interestingly, the main paper presents an efficacy analysis consisting of 452 patients. However, between the initiation of the interim analysis and cessation of trial procedures, the authors enrolled an additional 78 patients. The authors report findings from all 530 in their safety analysis, but exclude them from the primary efficacy analysis – consigning the full cohort analysis to a supplementary appendix. There is no obvious reason to do so – other than the fact the larger cohort demonstrates less favorable results for Andexxa, with the hemostatic efficacy composite dropping from 67.0% to 63.9%. As is frequently cautioned regarding stopping trials early, doing so inflates the confidence intervals, diminishes the precision of an effect size estimate, and precludes the natural propensity of regression to the mean.

Then, there are the trial procedures. Prior to a protocol amendment excluding subdural hematomas, the Andexxa group included 13 patients with SDH, as compared with only 4 in “usual care”. Subdural hematomas, generally speaking, have far less sinister an outcome than intracerebral hemorrhage – an imbalance favoring the Andexxa cohort. Then, bizarrely, only 85% of the “usual care” cohort received anticoagulation reversal using PCCs. Very little data is included regarding these 60 patients receiving “non-PCC” care at the discretion of their treating clinicians. What sort of selection bias led clinicians to withhold an active treatment for ICH? Without concrete data, it is impossible to do more than speculate, but it seems logical to theorize these patients must have been disadvantaged by their lack of treatment.

Next, there are The Downsides. Treatment with Andexxa very clearly causes increased arterial thrombotic events. Ischemic strokes occurred in 6.5% of those treated with Andexxa, as compared to 1.5% receiving “usual care”. Myocardial infarctions occurred in 4.2% of those treated with Andexxa, as compared to 1.5% of those receiving “usual care”. A smaller excess of pulmonary embolism was seen in the “usual care” arm, however.

Lastly, there are the patient-oriented outcomes. Naturally, with a trial stopped early due to a composite surrogate, the authors are quick to mention the trial is underpowered to evaluate these endpoints. However, the overall outcomes of patients included in this trial are grim – and they are more grim for those treated with Andexxa. At 30 days, only 28% of patients treated with Andexxa achieved a modified Rankin scale of 0 to 3, compared with 31% in the “usual care” cohort. Similarly, 27.8% of patients treated with Andexxa had died at 30 days, as compared with 25.5% of those receiving “usual care”.

So, there you have it – such a “success” story of a trial it needed to be stopped early, and we still have no clear evidence Andexxa ought to be favored over “usual care”. The authors merrily cite INTERACT1, the trial upon which the “hematoma growth” surrogate is “validated” – and they will rely on this heavily for marketing purposes. In the end, we have exactly what we ought to have expected from a trial designed to stand on its head to deliver for its product, and we as clinicians are ever-poorer for it.

Andexanet for Factor Xa Inhibitor-Associated Acute Intracerebral Hemorrhage