A Simple but Comprehensive Explanation of Statistical Tests, Power, and More- Part B

4 min readSep 18, 2021

In my previous article that discusses type 1 vs. type 2 error, I asked a question:

In the COVID testing example , which type of error bears more severe consequences?

Obviously, when the null hypothesis is that the person is not infected, type 2 error bears more severe consequences. An infected person is mistakenly diagnosed as normal and is let go without treatment. In a situation where type 2 error is more severe, we need to decrease the probability of making type 2 errors. A test that can decrease the probability of type 2 error is said to have high power.

High-power tests don’t come free. The cost is that we inevitably have lots of false positive cases. In other words, a test cannot reduce both type 1 and type 2 errors at the same time. There is a trade-off among them. Specifically, a powerful test reduces type 2 errors but increases type 1 errors, and a “lenient” test does the opposite. The power of a test is built in the test when it was developed by statisticians or cliometricians. We cannot change that. But statisticians or cliometricians usually developed multiple tests for similar tasks. We have the freedom to select a test with different power for our specific needs.

In some situations, the consequence of type 1 error is more severe. For example, a criminal trial can be regarded as a test (we use evidence instead of statistical calculation). The null hypothesis is the person is innocent; the alternative hypothesis is the person is guilty. Type 1 error means an innocent person is convicted (a mistake is made for challenging/rejecting the correct null hypothesis of innocence). In this situation, we need to be more lenient.

Now let’s talk about predictive analytics.

In predictive analytics where there are two outcomes (aka. dichotomous classification), say A and B, we also make two mistakes: label A as B, and vice versa. To examine those two mistakes we can borrow the concepts in statistical testing. If we replace the two outcomes, A and B, with negative and positive, respectively, and set up the null hypothesis as being negative, then the analysis of predictive mistakes is almost the same as analyzing type 1 and type 2 errors in statistical tests.

Formally, a prediction of dichotomous classification yields a two-by-two outcome table with four numbers (see the example below). The two numbers on the diagonal cells (45 and 90) are the counts of correct prediction. The other two (10 and 20) are the counts of incorrect prediction.

We first check out the 20 cases wrongly predicted as positive. Applying our stats test concepts, the null being negative, type 1 error means we wrongly challenge the null (by predicting a true negative case to be positive). Therefore, out of 65 real negative cases (45+ 20), these 20 cases are the result of the type 1 error. The chance of making a type 1 error in this prediction is 20/65≈31%. This value is called False Positive Rate (FPR) in predictive analytics, and sounds like a p-value to me (some people may disagree).

We then check out the 10 positive cases wrongly predicted as negative. Out of 100 real positive cases (90+ 10), these 10 cases result from the type 2 error. The chance someone is wrongly predicted (tested) negative is 10/100≈10%. That’s also called False Negative Rate (FNR) in predictive analytics.

To apply our stats test concepts of powerful test, a more “powerful” predictive algorithm will reduce FNR to make sure every positive case is correctly predicted. Hence, the most “powerful” predictive algorithm will predict every case positive. The FNR will drop to zero, but FPR skyrockets to 100% (all 65 true negative cases are predicted wrong; see the table below). Hence, the trade-off between FNR and FPR in predictive analytics is exactly the same as that between type 1 and type errors in statistical tests.

Data scientists don’t use type 1 or type 2 error, p-values or even FPR/FNR to evaluate the performance a predictive algorithm. They use a set of four basic metrics: accuracy, sensitivity, precision, and specificity. Sensitivity has another name recall, which is confusing to some of my students who listed sensitivity and recall as the two most important metrics in a quiz. They are the same.

Enough to be read. I will finish the discussion of the four metrics for predictive dichotomous classification in the next article.

A Simple but Comprehensive Explanation of Statistical Tests, Power, and More- Part B

Written by Martinqiu