August 10, 2019 · 2 mins read

Lesson 10 - Correlation vs Covariance

Machine learning is for a very large part about the relations between numbers. This problem, however, is not limited just to machine learning. It is a problem in mathematics, statistics more precisely, as well. In this post I want to give you a quick introduction to statistics by explaining the covariance and correlation coefficient.

This is the tenth post of my fast.ai journey. Read all posts here.

The Problem

We will have a very simple dummy dataset to explore in this post. Below is a table with the salary of fictional people and the price of their house.

Screenshot 2019-08-14 at 11.00.39

We want to describe the relation between these numbers, the salary and house price. In a typical machine learning approach, we could use linear regression to fit a function. However, because this post is about statistics, we will use something else.

Covariance

Covariance is a number that indicates how “strongly” two values are related.

The formula for covariance is:

is the mean of .

The covariance of this dataset would be:

A positive covariance means the more of (salary), the more (house price). Because the covariance is dependant on the unit of the data and the data itself, it cannot be used to compare relations. That is the reason the correlation coefficient is used more often.

Correlation coefficient

The correlation is a more statistical measure of how strongly and are related. If is the correlation coefficient for and , then . Of correlation coefficient of means there is a linear relation and stands for a negative linear relation. means there is no relation at all. The formula for correlation is given as:

where $S_x$ and are the standard deviations for and respectively.

The correlation coefficient for this dataset is approximately or which means there is a very strong relation between the salary and house price.

Conclussion

Covariance and correlation are both measures for the extent in which two columns are related. When it comes to choosing one over the other, correlation is preferred because it is independent of the data itself.

A huge thanks to Sam Miserendino for proofreading this post!