Why Statistical Significance is Killing Science

by Joseph Mercola, DO | Guest Writer

Published April 25, 2019 | Business, Academia

In 2016, the American Statistical Association¹ released an editorial warning against the misuse of statistical significance in interpreting scientific research. Another commentary was recently published
in the journal Nature,² calling for the research community to abandon the concept of statistical significance.
Before being published in Nature,³ the article states it was endorsed by more than 800 statisticians and scientists from around the world. Why are so many researchers concerned about the P-value in statistical analysis?
In 2014, George Cobb, a professor emeritus of mathematics and statistics, posed two questions to members of an American Statistical Association discussion forum.⁴ In the first question, he asked why colleges and grad schools teach P=0.05, and found this was the value used by the scientific community. In the second question he asked why the scientific community used this particular P-value and found this was what was taught in school.
In other words, it was circular logic that drove the continued belief in an arbitrary value of P=0.05. Additionally, researchers and manufacturers may alter the perception of statistical significance, demonstrating a positive response occurs in an experimental group over the control group simply by using either relative or absolute risk.
However, since many are not statisticians, it’s helpful to first understand the mathematical basis behind P-values, confidence intervals and how absolute and relative risk may be easily manipulated.

Probability Frameworks Define How Researchers Present Numbers

At the beginning of a study, researchers define a hypothesis, or a proposed explanation made on limited evidence, which they hope research will either prove or disprove. Once the data are gathered, researchers employ statisticians to analyze the information to determine whether or not the experiment proved their hypothesis.
The world of statistics is all about probability, which is simply how likely it is that something will or will not happen, based on the data. These collections of data from sample sizes are used in science to infer whether or not what happens in the sample size would likely happen in the entire population.⁵
For instance, if you wanted to find the average height of men around the world, you couldn’t measure every man’s height to get the answer, so researchers would estimate the number. Samples would be gathered from subpopulations to infer the height. These numbers are then evaluated using a framework. In many instances, medical research⁶ uses a Bayesian framework.⁷
Under a Bayesian framework, researchers see probabilities as a general concept. This framework has no problem assigning probabilities to nonrepeatable events.
Frequentist framework defines probability in repeatable random events that are equal to the long-term frequency of occurrence. In other words, they don’t attach probabilities to hypotheses or any fixed but unknown values in general.⁸
Within these frameworks the P-value is determined. The researcher first defines a null hypothesis, in which they state there is no difference or no change between the control group and the experimental group.⁹ The alternate hypothesis is opposite of the null hypothesis, stating there is a difference.

What’s Behind the Numbers?

The simple definition of the P-value is that it represents the probability of the null hypothesis being true. If P = 0.25 then there is a 25 percent probability of no change between the experimental group and the control group.¹⁰ In the medical field,¹¹ the acceptable P-value is 0.05, or the cut-off number resulting in a threshold considered to be statistically significant.
When the P-value is 0.05, or 5 percent, researchers say they have a confidence interval of 95 percent that there is a difference between the two observations, as opposed to differences due to random variations, and the null hypothesis is disproved.¹²
Researchers look for a small P-value, typically less than 0.05, to indicate strong evidence the null hypothesis may be rejected. When P-values are close to the cutoff, they may be considered marginal and able to go either way in most other fields.¹³
Since “perfectly” random samples cannot be obtained and definitive conclusions are difficult to confirm without perfectly random samples, the P-value attempts to minimize the sources of uncertainty.¹⁴
The P-value may then be used to define the confidence interval and confidence level. Imagine you’re trying to find out how many people from Ohio have taken two weeks of vacations in the past year. You could ask every resident in the state, but to save time and money you could sample a smaller group, and the answer would be an estimate.¹⁵ Each time you repeat the survey, the results may be slightly different.
When using this type of estimate, researchers use a confidence interval to determine a range of values above and below a finding the actual value is likely to fall. If the confidence interval is 4 and 47 percent of the sample takes a two-week vacation, researchers believe that had they asked the entire relevant population, then between 43 percent and 51 percent would have gone for a two-week vacation.
The confidence level is expressed as a percentage of how often the true percentage of the population would pick the answer lying within the confidence interval. If the confidence level is 95 percent, the researcher is 95 percent confident that between 43 percent and 51 percent would have gone on a two-week vacation.¹⁶

Scientists Rebelling Against Statistical Significance

Kenneth Rothman, professor of epidemiology and medicine at Boston University, took to Twitter with a copy of a letter to the JAMA editor after it was rejected from the medical journal.¹⁷ In the letter, signed by Rothman and two of his colleagues from Boston University, they outline their agreement with the American Statistical Association statement, stating,¹⁸ “Scientific conclusions and business or policy decisions should not be based only on whether a P-value passes a specific threshold.”
William M. Briggs, PhD, author and statistician, writes all statisticians have felt the stinging disappointment from clients when P-values do not fit the client’s expectations, despite explanations of how this significance has no bearing on real life and how there may be better methods of evaluating the experiment’s success.¹⁹
After receiving emails from other statisticians outlining their reasons for maintaining the status quo of using P-values to ascertain the value of a study, and ignoring arguments he lays out, Briggs goes on to say:²⁰

A popular thrust is to say smart people wouldn’t use something dumb, like P-values. To which I respond smart people do lots of dumb things. And voting doesn’t give truth.

Numbers May Not Accurately Represent Results

A recent editorial in the journal Nature delves into the reason why P-values, confidence intervals and confidence levels are not accurate representations of whether a study has proven or disproven its hypothesis. The authors urge researchers to:²¹

[N]ever conclude there is ‘no difference’ or ‘no association’ just because a P value is larger than a threshold such as 0.05 or, equivalently, because a confidence interval includes zero. Neither should we conclude that two studies conflict because one had a statistically significant result and the other did not. These errors waste research efforts and misinform policy decisions.

The authors compare an analysis of the effects of anti-inflammatory drugs between two studies. Although the actual data in both studies found the exact risk ratio of 1.2, since one study had more precise measurements, it found a statistically significant risk versus the second study, which did not. The authors wrote:²²

It is ludicrous to conclude that the statistically non-significant results showed ‘no association,’ when the interval estimate included serious risk increases; it is equally absurd to claim these results were in contrast with the earlier results showing an identical observed effect. Yet these common practices show how reliance on thresholds of statistical significance can mislead us.

The authors call for the entire concept of statistical significance to be abandoned and urge researchers to embrace uncertainty. Scientists should describe practical implications of values and limits of the data rather than relying on proving a null hypothesis and claiming no associations if the value of the interval is deemed unimportant.²³
They believe using confidence intervals as a comparison will eliminate bad practices and may introduce better ones. Instead of relying on statistical analysis, they hope scientists will include more detailed methods sections and emphasize their estimates by explicitly discussing the upper and lower limits in their confidence intervals.

Relative Risk or Absolute Risk?

George Canning was a British statesman and politician who served briefly as prime minister in England in 1827.²⁴ He was quoted in the Dictionary of Thoughts published in 1908, saying, “I can prove anything by statistics except the truth.”²⁵
As you read research or media stories, the risk associated with a particular action is usually expressed as relative risk or absolute risk. Unfortunately, the type of risk may not be identified. For instance, you may hear a particular action will reduce the risk of prostate cancer by 65 percent.
Unless you know if this refers to absolute risk or relative risk, it’s difficult to determine how much this action would affect you. Relative risk is a number used to compare the risk between two different groups, often an experimental group and a control group. The absolute risk is a number that stands on its own and does not require comparison.²⁶
For instance, imagine there were a clinical trial to evaluate a new medication researchers hypothesized would prevent prostate cancer, and 200 men signed up for the trial. The researchers split the group into two, with 100 men receiving a placebo and 100 men receiving the experimental drug.
In the control group, two men developed prostate cancer. In the treatment group only one man developed prostate cancer. When the two groups are compared, the researchers find there is a 50 percent reduction in prostate cancer when they talk about relative risk. This is because one developed it in the treatment group and two developed it in the control group.
Since one is half of two, there is a 50 percent reduction in the development of the disease. This number can sound really good and potentially encourage someone to take a medication with significant side effects if they believe it can cut their risk of prostate cancer in half.
The absolute risk, however, is far smaller. In the control group, 98 men never developed cancer. In the treatment group, 99 men never developed cancer. Put another way, in the control group, the risk of developing prostate cancer was 2 percent, since 2 out of 100 got cancer; while in the treatment group, the risk lowered to 1 percent.
This means there is a 1 percent absolute risk of developing prostate cancer with the medication, compared to 2 percent. The difference now—your absolute risk—is not 50 percent but 1 percent (2 minus 1). Knowing this, taking the drug may not seem worth it.

Note: This article was reprinted with the author’s permission. It was originally published on Dr. Mercola’s website at www.mercola.com.
Reference

Templeton Times

Sunday, April 28, 2019