Home » President's Corner

P-Values: To Own or Not to Own?

1 April 2019 3,116 views One Comment
Karen Kafadar

Karen Kafadar

The debate about the value of hypothesis testing and the over-reliance on p-values as a cornerstone of statistical methodology started well over a century ago, and it continues today. Many researchers, including statisticians, have commented about their use—and their abuse. Building on the presentations at the 2017 Symposium on Statistical Inference, the ASA published the March 2019 issue of The American Statistician devoted entirely to this topic. (If you haven’t done so already, I encourage you to read this issue. NPR, Nature, and many others commented on it the day the issue appeared.) The messages in the articles from that issue (all online) are not surprising to us: The “0.05 threshold” for p-values is arbitrary, and the notion of “p < 0.05” as “statistically significant” hardly makes sense in many (much less all) situations. Perhaps what is, or should be, surprising to us is where statisticians were when the “abuse” started to take hold.

Stephen Stigler notes this connection between p-values of 0.05 and “statistical significance” started well before Fisher: “Even in the 19th century, we find people such as Francis Edgeworth taking values ‘like’ 5%—namely 1%, 3.25%, or 7%—as a criterion for how firm evidence should be before considering a matter seriously” (CHANCE 21:4, 2008).

This sentence raises the central issue. How firm should evidence be “before considering a matter seriously”? The answer is one we statisticians have given frequently to our clients: “It depends.” Statisticians can be accused of using that phrase excessively.) How big is the study, how many inquiries do you plan to make of the data, how many analyses do you plan to run, what other data might bear on this study, what are the risks of false claims, …? In short, the answer requires us to think. (What a concept.)

Many years ago, I met a wonderful lady named Edith Flaster, a biostatistician from Columbia University. Throughout her life, Edith approached problems—in statistics and elsewhere—with sensible and practical solutions. Professionally, Edith had learned much from giants like Cuthbert Daniel and Fred Wood, who came through Columbia on numerous occasions. On one evening, she recalled the old days of computing on main frames, when every department had a computer budget and analyses cost real money. “Consequently,” she said, “you had to think very carefully before you burned your computer budget on an analysis; you wanted to be sure the analysis made sense before you ran it. Today, computing is cheap, so people run hundreds of analyses, without even thinking before they run them. I don’t care if you think before you run the analysis or after—but somewhere along the line you have to think.” Calculating p-values does not relieve us of our duty to remind our collaborators we still have to think. And the more p-values we calculate, the more we have to think.

Many of us would agree that, if we were to remove all thresholds for deciding when to take a result seriously, we may find ourselves back in the days of the Wild West. (Some may fear we are already there, given the proliferation of journals and analyses they contain.) We, unlike a few journal editors, recognize that adherence to a fixed p-value in all situations is not the antidote. And it is not a substitute for thinking. How many times has your collaborator insisted you include “(p < 0.05)” in the paper you are writing, “because the journal requires it”? Regrettably, stating the p-value (to several decimal places no less, as if anyone would believe them) has become a requirement for many journals.

On the other hand, we need some sort of structure. We agree that the fixed threshold of “p < 0.05,”and its identification with the term “statistical significance”, is not sensible. (Even Sir Ronald, who receives “credit” (or “blame”) for popularizing the 5% threshold, was reported to have said he’d be more likely to trust a result where p < 0.05 in 10 experiments than a result where p < 0.005 in a single experiment.) But if we advise scientists to dismiss any notion of thinking in advance about a level beyond which we take a result seriously, our profession may run the risk of being dismissed altogether—especially when our clients can go to “data scientists,” who won’t bother them with p-values at all—or, in fact, with any firm statistical foundations for their “scientific findings.”

The real question is, where were we statisticians, and where have we been, when our collaborators and journal editors set this limit as a criterion for publication? Many of us were conducting research that has allowed our profession to flourish. That’s been terrific. But the well-intentioned editors of scientific journals either ignored any notion of “thresholds for evidence” or insisted on an “algorithm” or “golden rule”—like “p < 0.05.” As we’ve reminded our colleagues in other professions, algorithms don’t always lead us to “truth.” Nonetheless, the structure of an algorithm can be useful in getting us to think.

Stigler ends his article in CHANCE with a thoughtful sentiment:

One may look to Fisher’s table for the F-distribution and his use of percentage points as leading to subsequent abuses by others. Or, one may consider the formatting of his tables as a brilliant stroke of simplification that opened the arcane domain of statistical calculation to a world of experimenters and research workers who would begin to bring a statistical measure to their data analyses. There is some truth in both views, but they are inextricably related, and I tend to give more attention to the latter, while blaming Fisher’s descendants for the former.

Alas, we are the descendants. We must take responsibility for the situation in which we find ourselves today (and during the past decades) regarding the use—and abuse—of our well-researched statistical methodology. And we must also, therefore, take responsibility for trying to change it.

I fervently hope the articles in the special issue of The American Statistician will not be viewed as a call to dismiss an area of our profession that has served, and continues to serve, us and science so well. Rather, I hope the articles will inspire us to encourage our colleagues to think about the data analysis process and to speak up to editors who, in their desire to bring structure to the inference process, may have gone just a little overboard. If anything, the continued controversy about p-values and statistical significance reminds us that our job as statisticians is far from done and that we are needed more than ever in this era of “data science” that embraces algorithms (with appealing names) and shuns complicated statistical inference. As noted in the last two columns, the debate reminds us to do the following:

  • Showcase all our talents—logical thinking, identification of process steps, design of relevant data collection, analysis and inference, characterization of uncertainty, clear results
  • Seize opportunities to create the demands for our talents—and then meet the demands with hard thinking
  • Be prepared to use our skills to present reasonable approaches to solving problems and encourage hard thinking, rather than blind adherence to fixed thresholds.

Please share your experiences—and your successes—in our mission to bring “sound thinking” to your collaborators. I look forward to hearing about them!

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading...

One Comment »

  • Ming Ji said:

    I wonder if we can find out which journal first started to require P<0.05 for publishing papers. And why other journals followed this practice?