Is the height of a person related to its weight? Correlation doubts.
Hello! Today I’m going to write about the correlation method, its types, and how to calculate it in Excel and R.
Correlation is a statistic or a method in statistics that measures the degree to which one variable is correlated/related to the other variable. It takes a value between -1 and +1 (those values are perfect correlation). It is very important to state in the beginning that correlation doesn’t measure causation but merely shows if two variables are associated with one another.
The easiest example on which I can explain the correlation method is the example of the height and weight of a person, so I’ll use that example afterward to show you a graphic notion of correlation.
TYPES OF CORRELATION
Normally, we have two types of correlation — positive and negative.
A positive correlation means that when one variable increases, the other one increases as well. The common example of that is the variables height and weight. As a person grows in height, it is normal that its weight increases as well. But that doesn’t mean that increment of height causes the increment in weight. There are more variables „behind-the-scenes“ of those relationships. For example, variables that influence one’s weight are nutrition, sports, etc.
A negative correlation means that when one variable decreases, the other one increases. The most common examples are outside temperatures and heating bills. As the outside temperature rise, the heating bills decrease because you don’t heat your apartment any longer.
Usually, the correlation is measured by correlation coefficient which goes, as I wrote earlier, from -1 to +1. Of course, negative value shows a negative correlation, and positive values show a positive correlation. For a correlation to be significant, or worthwhile looking at for me, is from 0,4 to 1 (-0,4 to -1), but that depends on the case and variables I’m looking at.
TYPES OF CORRELATION COEFFICIENTS AND HOW TO CALCULATE THEM IN EXCEL AND R
There are two most frequently used correlation coefficients — Pearson and Spearman.
If your distribution is normal, then you can use this correlation coefficient. If your distribution isn’t normally distributed, has many outliers, then you’ll use Spearman’s correlation coefficient, as this one doesn’t assume normally distributed data.
In R, you have a function named cor() you can see below.
As you can see, you need to put x and y variables in the function, and you can choose which method you’ll use — Pearson, Kendall, or Spearman. The most common ones are Pearson’s and Spearman’s, whereas the Kendall tau coefficient isn’t used that much, so I won’t write about it just now.
In Excel, you have a formula named =CORREL() which shows you the correlation between two data sets. You also have the =PEARSON() formula which uses Pearson’s correlation coefficient.
EXAMPLE OF CORRELATION
As I wrote earlier, I’m going to use the example of the height and weight of a person, to show you how correlation works. I’m going to use Pearson’s coefficient and assume that the distribution of variables is normal.
Let’s see what will R calculate.
As you can see, Pearson’s coefficient of variables height and weight is 0.98, which is very high. You can see on the plot that the relationship between height and weight is almost linear (you can draw a line connecting or going between those lines.
In Excel, we get the same coefficient, using both formulas for PEARSON and CORREL.
Correlation is very important for regression analysis, especially linear one. That is because, in order to get a significant linear regression model, the relationship between two variables needs to be linear and you need to be able to pull a straight line that is connecting or going very close to all dots on the plot. If it doesn’t, that means that the dot (observation) is an outlier, and needs to be handled.
More about that next time, and we’ll tackle regression analysis. I hope this short tutorial about correlation has taught you something.