Irreproducible Research & p-Values

I circulated this write-up from Nature last week on Twitter, but was again reminded by Axel Ellrodt of its importance when initiating research – particularly in the context of studies endlessly repeated, or slightly altered, until positive (read: Big Pharma).

The gist of this write-up – other than the fact the p-value was never really intended to be a test of scientific validity and significance – is the important notion a universal cut-off of 0.05 is inappropriate.  Essentially, if you’re familiar with Bayes Theorum and the foundation of evidence-based medicine, you understand the concept that – if a disease is highly unlikely, the rule-in test ought to be tremendously accurate.  Likewise, if a scientific hypothesis is unlikely, or an estimated effect size for a treatment is small, the p-value required to confirm a positive result needs to be greater than the one-in-twenty approximation of the traditional 0.05 cut-off.  The p-value, then, functions akin to the likelihood ratio – providing not a true dichotomous positive/negative outcome, but simply further adjusting the chance a result can be reliably replicated.

This ties back into the original purpose of p-value in research – to initially identify topics worth further investigation, not to conclusively confirm true effects.  This leads very obviously into the recurrent phenomenon through clinical and basic science research of irreproducible research.  And, indeed, Nature’s entire segment of the challenges of reproducibility in research is an excellent read for any developing investigator.

“Scientific method: Statistical errors”
http://www.nature.com/news/scientific-method-statistical-errors-1.14700

“Challenges in Irreproducible Research”
http://www.nature.com/nature/focus/reproducibility/index.html

3 thoughts on “Irreproducible Research & p-Values”

  1. Do you think that the issue of low pre-test probability can really be addressed by merely lowering the target P value? It seems like many of the issues of bias and so forth may be somewhat independent of effect size (in other words, with a sufficiently wonky study, there's no threshold for significance that's safe).

  2. Clearly – if study design is fundamentally flawed, no measure of statistical significance compensates for asking a question in the wrong fashion. But, as we see throughout all of medical literature, folks are saving $$$ by cutting down on sample size through the use of composite endpoints to try and hit the 0.05 number … explicitly weakening their study based on a threshold that universally ought not offer the same kind of outcomes certainty.

  3. That's fair. But it does seem one can make a good argument that individual studies should set a consistent bar to make comparisons easier, and then it's the job of us as readers — and particularly the job of review authors and guideline committees — to perform that whole Bayesian analysis. Then at the very least you can set your thresholds depending on your own credulity, because that correction is clearly a somewhat arbitrary one. (I would agree, however, that when DESIGNING a study, the authors should consider the pre-test probability, because ideally it should be sufficiently powered and controlled that a positive result would have the potential to successfully change our practice. Too often nowadays we have the situation of a study which is technically positive, but simply not adequate to impress anybody, due to the initial implausibility of the premise.)

    By that logic, of course, researchers might as well not establish any a priori threshold for significance at all. We'd just understand that certain results are not very definitive and others more so, but without any magic number to shoot for. (After all, we already understand that results with a p of .06 versus a p of .04 are not vastly different, despite only one being technically "significant.") That also might reduce the motivation to gerrymander studies to slip under the rope…

Comments are closed.