A Simple but Comprehensive Explanation of Statistical Tests with Type 1 vs. Type 2 Errors, P-values, Power, Sensitivity, Precision, and More- -Part A

Martinqiu
5 min readSep 17, 2021

--

One reason for write this article is the recent popularity of A/B tests in business analytics. I was involved in designing several A/B testing jobs for companies. The evaluation of A/B tests critically depends on statistical tests. Another reason is the use of sensitivity, recall and specificity in COVID-19 related discussions among me. Some were wrong. And I want to tell them why.

Why do we need statistical tests? The simple reason is that in complicated situations, our cognitive function cannot tell the difference correctly. We need math to help. I illustrate this point via an example. I begin with the introduction to the null hypothesis.

In every mainstream statistical test, there is a null hypothesis that we intend to challenge. If we are interested in whether offering free WiFi in a grocery store will increase sales, the null hypothesis states the sales are the same if free WiFi is provided. If you are OK with this statement, you can walk away. But if you disagree, you will put down an opposing argument, called the alternative hypothesis, stating that offering free WiFi will increase sales, as what I intend to show in the example. Formally, we have

Hull (H0): Offering free WiFi does not increase sales

Alternative (H1): Offering free WiFi increases sales

Then we design the comparison and collect data on sales. We, of course, hope the data will challenge the null and prove that providing free WiFi increases sales.To implement the comparison, on some days, we offer free WiFi, on some other days, we don’t, for a number of days. Or we choose a time block in a day to provide free WiFi. The key is to make the offering of free WiFi random, not associated with any patterns (always free WiFi in the morning or always on weekends shall be avoided). This is about experiment design, and I don’t elaborate on it too much in this article.

Assume we implement free WiFi or not on the same day, and balance the total duration of free WiFi and no WiFi in a day. Finally, we have the data, looking like in the table below.

Back to the question: does free WiFi bring more sales than no WiFi? Do the data challenge the null and support the alternative hypothesis? Checking the data, we find there are some days when offering free WiFi brings more sales and some days fewer sales (e.g., Days 2, 3, and 6). So simply comparing the sales numbers in each day won’t provide a decisive answer. We need to use statistical tests to answer this question.

What’s the type 1 error? Type 1 error is the situation you are fooled by chance to challenge the null. You thought you found something meaningful (e.g., free WiFi increases sales), but indeed, you don’t. It’s just you getting lucky (or unlucky) that your data happen to challenge the null hypothesis. If you collect new data and run tests again, you will realize the finding is “fake,” and you do not really reject the null.

The probability of type 1 error (i.e., claim a fake finding) is called p-values ( a very lose definition but approximately correct). Nobody likes fake findings, but the chances of fake findings are always there, so the only thing people can do is to minimize the chances of fake findings. How do we know this chance is large or small? We first use statistics to compute a p-value for that test conducted on the specific data we collected. All stats packages have p-values automatically calculated for you. No worries. P-values are data dependent. If you run the same test on a different data, stats packages will calculate a different p-value. So if you are unlucky to get a bad data set, the p-value could be screwed.

Next, we compare p-values with a small cut-off value . This cut-off value is usually set at 5% (no idea who chose this cut-off value and why). That’s why we require p-values to be less than 5% to build confidence in challenging the hull hypothesis, and the smaller the p-value, the better. 5% means you get unlucky to discover a fake finding once in every 20 times. That’s acceptable. If the p-value is greater than 5%, we won’t be able to challenge the null. Even if you believe so, people are unlikely accept that. As a side note, in some disciplines, people use alpha (α) level to name the cut-off values (5% is the most commonly used, 1% is not rare).

Let’s look at another example. There are two decks of cards. Deck A has 20 aces and four twos. Deck B has 20 twos and four aces.

Deck A
Deck B

I put one deck (all cards are faced down) in front of you, and ask you to draw only one card to tell me which deck it is, A or B.

If you randomly draw an ace from the deck, and you guess the deck is deck A (most people will do, including me), there is still a chance (1/6) that the deck is B.

Formally put, if we frame the task above as a statistical test, the null hypothesis is the deck is B. The alternative is that the deck is A. type 1 error is the mistake you think the deck is A, but indeed it is B.

The probability of this error,1/616.7% is the a of the test. At 5%, we cannot assert with great confidence that the deck is A.

Accordingly, type 2 errors are the mistakes that the null hypothesis should be challenged, but unluckily you fail to do so. Now you will realize the contents of type 1 and type 2 errors completely depend on which opposing statement is used as the null hypothesis, and which as the alternative. We usually specify the null hypothesis as the status quo, no change, no difference, normal, etc. In medical science and related disciplines, negative results are the null and positive are the alternative. That’s why we need to avoid positive people in the Year of COVID.

For example, if we test someone is infected with the virus or not, the null is that person is not affected. The alternative is that person is infected. Type 1 error is that person is not infected, but the result shows positive. Type 2 error is that person is infected, but the result is negative.

Enough to be read. Before I conclude this article, here are two questions to ponder. In the COVID example above, which type of error bears more severe consequences? What is the implication for designing COVID-19 tests? I will return with answers.

--

--

Martinqiu

I am a marketing professor and I teach BDMA (big data and marketing analytics) at Lazaridis School of Business, Wilfrid Laurier University, in Waterloo, Canada.