Originally published on my personal blog.
When you’re starting your machine learning journey, you’ll come across null hypothesis and the p-value. At a certain point in your journey, it becomes quite important to know what these mean to make meaningful decisions while designing your machine learning models. So in this post, I’ll try to explain what these two things mean, and you try to understand that.
Now, if you don’t have a background in statistics, the definitions of null hypothesis and p-value will make no sense to you. It’s just gibberish going way over your head. That’s what happened to me the first few times I tried to understand them. It took me a good couple of days to get an idea of what they mean. I could still be wrong in my understanding to this very day. And I’m sure that you guys will have more knowledge about this than me and will correct me in the comments. So looking forward to that.
To understand this, let’s have a look at some real world data. I looked at the Air Quality Index (AQI) for the city I live in (which is Bangalore) just now and it says the current AQI is 162, which is like 62 units over the “satisfactory” quality index.
For those of you who don’t know, if the AQI is with the 0–50 range, the air is good, anything more than that but within 100 is moderate. And anything beyond that is unhealthy.
Now, I’ll refer this website, or any other website for that matter, and tell you that the AQI for my city is 162, which is unhealthy. In this statement, I’m stating something and giving proof for both the value, and why it is unhealthy. So now, that statement becomes the Null Hypothesis. In other words, we can say that the result or the outcome of a test could be considered as the null hypothesis. The null hypothesis is represented as H0 (H-not).
People with enough experience with statistics and null hypothesis would be first to point out that the statement I chose for this as the example could be broken down into two null hypotheses, and that’s true. Those two would be:
- The Air Quality Index for Bangalore right now is 162.
- Air Quality Index of more than 150 is unhealthy.
To keep this example simple, we’ll consider the first statement, that the AQI is 162.
Null Hypothesis -> The Air Quality Index for Bangalore on 8th November 2019 at 6 PM is 162.
This statement could be wrong. The people who published this value did a bunch of tests with their tools and came up with this value at the end of the test. But you might have your own tools to measure the quality of the air, or you may feel the air looks too clean to have such a high value. So you want to contest or dispute this statement, and you are ready to prove that the AQI is much less than 162 for that date, time, and place. So your “Alternative Hypothesis” becomes:
Alternative Hypothesis -> The Air Quality Index for Bangalore on 8th November 2019 at 6 PM is less than 162.
One thing to note here is that null hypotheses are usually the statements which the scientists want to prove wrong, but will start the research towards that goal assuming that the null hypothesis is true. Real world examples of a null hypothesis would be something like this:
- The average income for men in the tech industry is the same as the average income for women in the tech industry.
- There is no correlation between frustration and aggression.
As you can see, the scientists assume these statements to be true, or facts, and will start their research or tests to prove these wrong. But similar to what happens in the court of law, these statements are considered to be true unless proven wrong. When the process to prove them wrong starts, the scientists will form another statement which would be the opposite of these null hypotheses, and those new statements become the alternative hypotheses. So, the alternative hypotheses for the statements above would be:
- The average income for men in the tech industry is NOT the same as the average income for women in the tech industry.
- There is no a correlation between frustration and aggression.
Alternative hypothesis are represented as H1 or HA.
I hope you understood this very confusing concept. Keeping the null hypothesis in mind, we’ll move on to P-value.
We can define the p-value as follows:
In statistics, the p-value is the probability of obtaining the observed results of a test, assuming that the null hypothesis is correct.
Well, it is just one of the definitions of the p-value. It is comparatively easy to understand the p-value after you understand what null hypothesis is. P-value is the probability that you would arrive at the same results as the null hypothesis. So if we consider the average income in the tech industry example from the last section, we can say that the p-value is the probability of finding out that the average income for both men and women in the tech industry is the same. We’re considering the probability for the case where the income is same (and not for the case where the income is different) because we believe that the null hypothesis is true.
One of the most commonly used p-value is 0.05. If the calculated p-value turns out to be less than 0.05, the null hypothesis is considered to be false, or nullified (hence the name null hypothesis). And if the value is greater than 0.05, the null hypothesis is considered to be true. Let me elaborate a bit on that.
Remember that p-value is the probability that we’ll get the same results as the null hypothesis, and in our example, the threshold for that probability is 0.05. So if the calculated p-value is less than 0.05, it means that there’s very less probability that we’ll get the same results as the null hypothesis. And if the p-value is more than 0.05, then the probability of getting the same results as null hypothesis is very high, so we can consider the null hypothesis to be true.
Again, I hope you understood that. I don’t really know how to explain these things without confusing you or getting confused myself. But I tried my best. If you have better and easier examples, please leave them below as comments.