What Does a Sepsis Alert Gain You?

The Electronic Health Record is no longer simply that – a recording of events and clinical documentation.  Decision-support has, for good or ill, morphed it into a digital nanny vehicle for all manner of burdensome nagging.  Many systems have implemented a “sepsis alert”, typically based off vital signs collected at initial assessment. The very reasonable goal is early detection of sepsis, and early initiation of appropriately directed therapy. The downside, unfortunately, is such alerts are rarely true positives for severe sepsis in broadest sense – alerts far outnumber the instances in a change of clinical practice results in a change in outcome.

So, what to make of this:

This study describes a before-and-after performance of a quality improvement intervention to reduce missed diagnoses of sepsis, part of which was introduction of a triage-based EHR alert. These alerts fired during initial assessment based on abnormal vital signs and the presence of high-risk features. The article describes baseline characteristics for a pre-intervention phase of 86,037 Emergency Department visits, and then a post-intervention phase of 96,472 visits. During the post-intervention phase, there were 1,112 electronic sepsis alerts, 265 of which resulted in initiation of sepsis protocol after attending physician consultation.  The authors, generally, report fewer missed or delayed diagnoses during the post-intervention period.

But, the evidence underpinning conclusions from these data – as relating to improvements in clinical care or outcomes, or even the magnitude of process improvement highlighted in the tweet above – is fraught. The alert here is reported as having a sensitivity of 86.2%, and routine clinical practice picked up nearly all of the remaining cases that were alert negative.  The combined sensitivity is reported to be 99.4%.  Then, the specificity appears to be excellent, at 99.1% – but, for such an infrequent diagnosis, even using their most generous classification for true positives, the false alerts outnumbered the true alerts nearly 3 to 1.

And, that classification scheme is the crux of determining the value of this approach. The primary outcome was defined as either treatment on the ED sepsis protocol or pediatric ICU care for sepsis. Clearly, part of the primary outcome is directly contaminated by the intervention – an alert encouraging use of a protocol will increase initiation, regardless of appropriateness. This will not impact sensitivity, but will effectively increase specificity and directly inflate PPV.

This led, importantly, for the authors to include a sensitivity analysis looking at their primary outcome. This analysis looks at the differences in overall performance if stricter rules for a primary outcome might be entertained. These analyses evaluate the predictive value of the protocol if true positives are restricted to those eventually requiring vasoactive agents or pediatric ICU care – and, unsurprisingly, even this small decline in specificity results in dramatic drops in PPV – down to 2.4% for the alert alone.

This number better matches the face validity we’re most familiar with for these simplistic alerts – the vast majority triggered have no chance of impacting clinical care and improving outcomes. It should further be recognized the effect size of early recognition and intervention for sepsis is real, but quite small – and becomes even smaller when the definition broadens to cases of lower severity. With nearly 100,000 ED visits in both the pre-intervention and post-intervention periods, there is no detectable effect on ICU admission or mortality. Finally, the authors focus on their “hit rate” of 1:4 in their discussion – but, I think it is more likely the number of alerts fired for each each case of reduced morbidity or mortality is on the order of hundreds, or possibly thousands.

Ultimately, the reported and publicized magnitude of the improvement in clinical practice likely represents more smoke and mirrors than objective improvements in patient outcomes, and in the zero-sum game of ED time and resources, these sorts of alerts and protocols may represent important subtractions from the care of other patients.

“Improving Recognition of Pediatric Severe Sepsis in the Emergency Department: Contributions of a Vital Sign–Based Electronic Alert and Bedside Clinician Identification”

http://www.annemergmed.com/article/S0196-0644(17)30315-3/abstract

You’ve Got (Troponin) Mail

It’s tragic, of course, no one in this generation will understand the epiphany of logging on to America Online and being greeted by its almost synonymous greeting “You’ve got mail!” But, we and future generations may bear witness to the advent of something almost as profoundly uplifting: text-message troponin results.

These authors conceived and describe a fairly simple intervention in which test results – in this case, troponin – were pushed to clinicians’ phones as text messages. In a pilot and cluster-randomized trial with 1,105 patients for final analysis, these authors find the median interval from troponin result to disposition decision was 94 minutes in a control group, as compared with 68 minutes in the intervention cohort. However, a smaller difference in median overall length of stay did not reach statistical significance.

Now, I like this idea – even though this is clearly not the study showing generalizable definitive benefit. For many patient encounters, there is some readily identifiable bottleneck result of greatest importance for disposition. If a reasonable, curated list of these results are pushed to a mobile device, there is an obvious time savings with regard to manually pulling these results from the electronic health record.

In this study, however, the median LOS for these patients was over five hours – and their median LOS for all patients receiving at least one troponin was nearly 7.5 hours. The relative effect size, then, is really quite small. Next, there are always concerns relating to interruptions and unintended consequences on cognitive burden. Finally, it logically follows if this text message derives some of its beneficial effect by altering task priorities, then some other process in the Emergency Department is having its completion time increased.

I expect, if implemented in a typically efficient ED, the net result of any improvement might only be a few minutes saved across all encounter types – but multiplied across thousands of patient visits for chest pain, it’s still worth considering.

“Push-Alert Notification of Troponin Results to Physician Smartphones Reduces the Time to Discharge Emergency Department Patients: A Randomized Controlled Trial”
http://www.annemergmed.com/article/S0196-0644(17)30317-7/abstract

Correct, Endovascular Therapy Does Not Benefit All Patients

Unfortunately, that headline is the strongest takeaway available from these data.

Currently, endovascular therapy for stroke is recommended for all patients with a proximal arterial occlusion and can be treated within six hours. The much-ballyhooed “number needed to treat” for benefit is approximately five, and we have authors generating nonsensical literature with titles such as “Endovascular therapy for ischemic stroke: Save a minute—save a week” based on statistical calisthenics from this treatment effect.

But, anyone actually responsible for making decisions for these patients understands this is an average treatment effect. The profound improvements of a handful of patients with the most favorable treatment profiles obfuscate the limited benefit derived by the majority of those potentially eligible.

These authors have endeavored to apply a bit of precision medicine to the decision regarding endovascular intervention. Using ordinal logistic regression modeling, these authors used the MR CLEAN data to create a predictive model for good outcome (mRS score 0-2 at 90 days). These authors subsequently used the IMS-III data as their validation cohort. The final model displayed a C-statistic of 0.69 for the ordinal model and 0.73 for good functional outcome – which is to say, the output is closer to a coin flip than a informative prediction for use in clinical practice.

More importantly, however, is whether the substrate for the model is anachronistic, limiting its generalizability to modern practice. Beyond MR CLEAN, subsequent trials have demonstrated the importance of underlying tissue viability using either CT perfusion or MRI-based selection criteria when making treatment decisions. Their model is limited in its inclusion of just a measure of collateral circulation on angiogram, which is only a surrogate for potential tissue viability. Furthermore, the MR CLEAN cohort is comprised of only 500 patients, and the IMS-III validation only 260. This sample is far too small to properly develop a model for such a heterogenous set of patients as those presenting with proximal cerebrovascular occlusion. Finally, the choice of logistic regression can be debated, simply from a model standpoint, given its assumptions about underlying linear relationships in the data.

I appreciate the attempt to improve outcomes prediction for individual patients, particularly for a resource-intensive therapy such as endovascular intervention in stroke. Unfortunately, I feel the fundamental limitations of their model invalidate its clinical utility.

“Selection of patients for intra-arterial treatment for acute ischaemic stroke: development and validation of a clinical decision tool in two randomised trials”
http://www.bmj.com/content/357/bmj.j1710

No Change in Ordering Despite Cost Information

Everyone hates the nanny state. When the electronic health record alerts and interrupts clinicians incessantly with decision-“support”, it results in all manner of deleterious unintended consequences. Passive, contextual decision-support has the advantage of avoiding this intrusiveness – but is it effective?

It probably depends on the application, but in this trial, it was not. This is the PRICE (Pragmatic Randomized Introduction of Cost data through the Electronic health record) trial, in which 75 inpatient laboratory tests were randomized to display of usual ordering, or ordering with contextual Medicare cost information. The hope and study hypothesis was the availability of this financial interest would exert a cultural pressure of sorts on clinicians to order fewer tests, particularly those with high costs.

Across three Philadelphia-area hospitals comprising 142,921 hospital admissions in a two-year study period, there were no meaningful differences in lab tests ordered per patient day in the intervention or the control. Looking at various subgroups of patients, it is also unlikely there were particularly advantageous effects in any specific population.

Interestingly, one piece of feedback the authors report is the residents suggest most of their routine lab test ordering resulted from admission order sets. “Routine” daily labs are set in motion at the time of admission, not part of a daily assessment of need, and thus a natural impediment to improving low-value testing. However, the authors also note – and this is probably most accurate – because the cost information was displayed ubiquitously, physicians likely became numb to the intervention. It is reasonable to expect substantially more selective cost information could have focused effects on an adea of particularly high cost or low-value.

“Effect of a Price Transparency Intervention in the Electronic Health Record on Clinician Ordering of Inpatient Laboratory Tests”

http://jamanetwork.com/journals/jamainternalmedicine/fullarticle/2619519

Oh, The Things We Can Predict!

Philip K. Dick presented us with a short story about the “precogs”, three mutants that foresaw all crime before it could occur. “The Minority Report” was written in 1956 – and, now, 60 years later we do indeed have all manner of digital tools to predict outcomes. However, I doubt Steven Spielberg will be adapting a predictive model for hospitalization for cinema.

This is a rather simple article looking at a single-center experience at using multivariate logistic regression to predict hospitalization. This differs, somewhat, from the existing art in that it uses data available at 10, 60, and 120 minutes from the arrival to the Emergency Department as the basis for its “progressive” modeling.

Based on 58,179 visits ending in discharge and 22,683 resulting in hospitalization, the specificity of their prediction method was 90% with a sensitivity or 96%,for an AUC of 0.97. Their work exceeds prior studies mostly on account of improved specificity, compared with the AUCs of a sample of other predictive models generally between 0.85 and 0.89.

Of course, their model is of zero value to other institutions as it will overfit not only on this subset of data, but also the specific practice patterns of physicians in their hospital. Their results also conceivably could be improved, as they do not actually take into account any test results – only the presence of the order for such. That said, I think it is reasonable to suggest similar performance from temporal models for predicting admission including these earliest orders and entries in the electronic health record.

For hospitals interested in improving patient flow and anticipating disposition, there may be efficiencies to be developed from this sort of informatics solution.

“Progressive prediction of hospitalisation in the emergency department: uncovering hidden patterns to improve patient flow”
http://emj.bmj.com/content/early/2017/02/10/emermed-2014-203819

Can We Trust Our Computer ECG Overlords?

If your practice is like my practice you see a lot of ECGs from triage. ECGs obtained for abdominal pain, dizziness, numbness, fatigue, rectal pain … and some, I assume, are for chest pain. Every one of these ECGs turns into an interruption for review to ensure no concerning evolving syndrome is missed.

But, a great number of these ECGs are read as “Normal” by the computer – and, anecdotally, are nearly universally correct.  This raises a very reasonable point as to question whether a human need be involved at all.

This simple study tries to examine the real-world performance of computer ECG reading, specifically, the Marquette 12SL software. Over a 16-week convenience sample period, 855 triage ECGs were performed, 222 of which were reported as “Normal” by the computer software. These 222 ECGs were all reviewed by a cardiologist, and 13 were ultimately assigned some pathology – of which all were mild, non-specific abnormalities. Two Emergency Physicians also then reviewed these 13 ECGs to determine what, if any, actions might be taken if presented to them in a real-world context. One of these ECGs was determined by one EP to be sufficient to put the patient in the next available bed from triage, while the remainder required no acute triage intervention. Retrospectively, the patient judged to have an actionable ECG was discharged from the ED and had a normal stress test the next day.

The authors conclude this negative predictive value for a “Normal” read of the ECG approaches 99%, and could potentially lead to changes in practice regarding immediate review of triage ECGs. While these findings have some limitations in generalizability regarding the specific ECG software and a relatively small sample, I think they’re on the right track. Interruptions in a multi-tasking setting lead to errors of task resumption, while the likelihood of significant time-sensitive pathology being missed is quite low. I tend to agree this could be a reasonable quality improvement intervention with prospective monitoring.

“Safety of Computer Interpretation of Normal Triage Electrocardiograms”
https://www.ncbi.nlm.nih.gov/pubmed/27519772

The Machine Can Learn

A couple weeks ago I covered computerized diagnosis via symptom checkers, noting their imperfect accuracy – and grossly underperforming crowd-sourced physician knowledge. However, one area that continues to progress is the use of machine learning for outcomes prediction.

This paper describes advances in the use of “big data” for prediction of 30-day and 180-day readmissions for heart failure. The authors used an existing data set from the Telemonitoring to Improve Heart Failure Outcomes trial as substrate, and then applied several unsupervised statistical models to the data with varying inputs.

There were 236 variables available in the data set for use in prediction, weighted and cleaned to account for missing data. Compared with the C statistic from logistic regression as their baseline comparator, the winner was pretty clearly Random Forests. With a baseline 30-day readmission rate of 17.1% and 180-day readmission of 48.9%, the C statistic for the logistic regression model predicting 30-day readmission was 0.533 – basically no predictive skill. The Random Forest model, however, achieved a C statistic of 0.628 by training on the 180-day data set.

So, it’s reasonable to suggest there are complex and heterogenous data for which machine learning methods are superior to traditional models. These are, unfortunately, pretty terrible C statistics, and almost certainly of very limited use for informing clinical care. As with most decision-support algorithms, I would be curious also to see a comparison with a hypothetical C statistic for clinician gestalt. However, for some clinical problems with a wide variety of influential factors, these sorts of models will likely become increasingly prevalent.

“Analysis of Machine Learning Techniques for Heart Failure Readmissions”
http://circoutcomes.ahajournals.org/content/early/2016/11/08/CIRCOUTCOMES.116.003039

The Mechanical Doctor Turk

Automated diagnostic machines consisting of symptom checklists have been evaluated in medicine before. The results were bleak:  symptom-checkers put the correct diagnosis first only 34% of the time, and had the correct diagnosis in the top three only 51% of the time.

However, when these authors published their prior study, they presented these findings in a vaccuum – despite their poor performance, how did this compare against human operators? In this short research letter, then, these authors, compare the symptom-checker performance against clinicians contributing to a sort of crowdsourced medical diagnosis system.

And, at least for awhile longer, the human-machine is superior than the machine-machine. Humans reading the same vignettes placed the correct diagnosis first 72.1% of the time, and in the top three 84.3% of the time.

With time and further natural language processing and machine learning methods, I expect automated diagnosis engines to catch up with humans – but we’re not there yet!

“Comparison of Physician and Computer Diagnostic Accuracy.”
https://www.ncbi.nlm.nih.gov/pubmed/27723877

Finding the Holes in CPOE

Our digital overlords are increasingly pervasive in medicine. In many respects, the advances of computerized provider order-entry are profoundly useful: some otherwise complex orders are facilitated, serious drug-interactions can be checked, along with a small cadre of other benefits. But, we’ve all encountered its limitations, as well.

This is a qualitative descriptive study of medication errors occurring despite the presence of CPOE. This prospective FDA-sponsored project identified 2,522 medication errors across six hospitals, 1,308 of which were related to CPOE. These errors fell into two main categories: CPOE failed to prevent the error (86.9%) and CPOE facilitated the error (13.1%).

CPOE-facilitated errors are most obvious. For example, these include instances in which an order set was out-of-date, and a non-formulary medication order resulted in delayed care for a patient; interface issues resulting in mis-clicks or misreads; or instances in which CPOE content was simply erroneous.

More interesting, however, are the “failed to prevent the error” issues – which are things like dose-checking and interaction-checking failures. The issue here is not specifically the CPOE, but that providers have become so dependent upon the CPOE to be a reliable safety mechanism that we’ve given up agency to the machine. We are bombarded by so many nonsensical alerts, we’ve begun to operate under an assumption that any order failing to anger our digital nannies must be accurate. These will undoubtedly prove to be the most challenging errors to stamp out, particularly as further cognitive processes are offloaded to automated systems.

“Computerized prescriber order entry– related patient safety reports: analysis of 2522 medication errors”
http://jamia.oxfordjournals.org/content/early/2016/09/27/jamia.ocw125

Stumbling Around Risks and Benefits

Practicing clinicians contain multitudes: the vastness of critical medical knowledge applicable to the nearly infinite permutaions of individual patients.  However, lost in the shuffle is apparently a grasp of the basic fundamentals necessary for shared decision-making: the risks, benefits, and harms of many common treatments.

This simple research letter describes a survey distributed to a convenience sample of residents and attending physicians at two academic medical centers. Physicians were asked to estimate the incidence of a variety of effects from common treatments, both positive and negative. A sample question and result:

treatment effect estimates
The green responses are those which fell into the correct range for the question. As you can see, in these two questions, hardly any physician surveyed guessed correctly.  This same pattern is repeated for the remaining questions – involving peptic ulcer prevention, cancer screening, and bleeding complications on aspirin and anticoagulants.

Obviously, only a quarter of participants were attending physicians – though no gross differences in performance were observed between various levels of experience. Then, some of the ranges are narrow with small magnitudes of effect between the “correct” and “incorrect” answers. Regardless, however, the general conclusion of this survey – that we’re not well-equipped to communicate many of the most common treatment effects – is probably valid.

“Physician Understanding and Ability to Communicate Harms and Benefits of Common Medical Treatments”
http://www.ncbi.nlm.nih.gov/pubmed/27571226