9  Correlation

When we have finished this chapter, we should be able to:

Learning objectives
  • Explain the concept of correlation of two numeric variables.
  • Understand the most commonly used correlation coefficients, Pearson’s r and Spearman’s \(r_{s}\) coefficients.

9.1 What is correlation?

Correlation is a statistical method used to assess a possible association between two numeric variables. There are several statistics that we can use to quantify correlation depending on the underlying relation of the data. In this chapter, we’ll learn about two correlation coefficients:

  • Pearson’s \(r\)

  • Spearman’s \(r_{s}\)

Pearson’s coefficient measures linear correlation, while the Spearman’s coefficient compare the ranks of data and measures monotonic associations.

 

9.2 Linear correlation (Pearson’s \(r\) coefficient)

Graphical display with a scatter plot

The most useful graph for displaying the association between two numeric variables is a scatter plot. Figure 9.1 shows the association between systolic blood pressure (sbp) and diastolic blood pressure (dpb) in 96 patients with carotid artery disease, aged 42-89, prior to surgery. (Note that sbp and dpb can be plotted on either axis).

Example-Association between systolic and diastolic blood pressure

Figure 9.1: Scatter plot of the association between systolic blood pressure (sbp) and diastolic blood pressure (dbp) in 96 patients with carotid artery disease, aged 42-89, prior to surgery.

From the scatter plot, there appears to be a linear association between sbp and dbp, with higher values of dbp being associated with higher values of sbp. How can we summarize this association simply? We could calculate the Pearson’s correlation coefficient, \(r\), which is a measure of the linear association between two numeric variables. The Pearson’s correlation coefficient is based on the sum of products about the mean of the two variables, so we shall start by considering the properties of the sum of products.

Figure 9.2: Scatter plot with axes through the mean point.

Figure 9.2 shows the scatter diagram of Figure 9.1 with two blue new axes drawn through the mean point. The distances of the points from these axes represent the deviations from the mean.

Positive product: In the top right section of Figure 9.2, the deviations from the mean of both variables, dbp and sbp, are positive. Hence, their products will be positive. In the bottom left section, the deviations from the mean of the two variables will both be negative. Again, their product will be positive.

Negative product: In the top left section of Figure 9.2, the deviation of dbp from its mean will be negative, and the deviation of sbp from its mean will be positive. The product of these will be negative. In the bottom right section, the product will again be negative.

So in Figure 9.2 most of these products will be positive, and their sum will be positive. We say that there is a positive correlation between the two variables; as one increases so does the other. If one variable decreased as the other increased, we would have a scatter diagram where most of the points lay in the top left and bottom right sections. In this case the sum of the products would be negative and there would be a negative correlation between the variables. When the two variables are not related, we have a scatter diagram with roughly the same number of points in each of the sections. In this case, there are as many positive as negative products, and the sum is zero. There is zero correlation or no correlation. The variables are said to be uncorrelated.

 

Pearson’s \({r}\) correlation coefficient

The Pearson’s correlation coefficient, \({r}\), can be calculated for any dataset with two numeric variables. However, before we calculate the Pearson’s \({r}\) coefficient we should make sure that the following assumptions are met:

Assumptions for Pearson’s \(r\) coefficient
  1. The variables are observed on a random sample of individuals (each individual should have a pair of values).
  2. There is a linear association between the two variables.
  3. For a valid hypothesis testing and calculation of confidence intervals both variables should have an approximately normal distribution.
  4. Absence of outliers in the data set.

 

Characteristics of Pearson’s correlation coefficient \(r\)

Formula

Given a set of \({n}\) pairs of observations \((x_{1},y_{1}),\ldots ,(x_{n},y_{n})\) with means \(\bar{x}\) and \(\bar{y}\) respectively, \(r\) is defined as:

\[r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n(y_i - \bar{y})^2}} \tag{9.1}\]

The \(r\) statistic shows the direction and measures the strength of the linear association between the variables.

Range of values

Correlation coefficient is a dimensionless quantity that takes a value in the range -1 to +1.

 

Direction of the association

A negative correlation coefficient indicates that one variable decreases in value as the other variable increases (and vice versa), a zero value indicates that no association exists between the two variables, and a positive coefficient indicates that both variables increase (or decrease) in value together.

Figure 9.3: The direction of association can be (a) negative, (b) no association, or (c) positive.

 

Magnitude of the association

The magnitude of association can be anywhere between -1 and +1. The stronger the correlation, the closer the correlation coefficient comes to ±1 (Figure 9.4). A correlation coefficient of -1 or +1 indicates a perfect negative or positive association, respectively (Figure 9.4 c and f).

Figure 9.4: The stronger the correlation, the closer the correlation coefficient comes to ±1.

 

Interpretation of the association

The ?tbl-correlation demonstrates how to interpret the strength of an association.

Interpretation of the values of the sample estimate of the correlation coefficient {#tbl-correlation}
Value of r Strength of association
\(|r| \geq{0.8}\) very strong association
\(0.6\leq|r| < 0.8\) strong association
\(0.4\leq|r| < 0.6\) moderate association
\(0.2\leq|r| < 0.4\) weak association
\(|r| < 0.2\) very weak association

 

Anscombe’s Quartet

Anscombe’s quartet comprises four datasets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties.

Figure 9.5: Anscombe’s quartet. All datasets have a Pearson’s correlation of r = 0.82.

Though all datasets have a Pearson’s correlation of \(r = 0.82\), when plotted the four datasets look very different. Graph I is a standard linear association where a Pearson’s correlation would be suitable. Graph II would appear to be a non-linear association and a non-parametric analysis would be appropriate. Graph III again shows a linear association (approaching r = 1) where an outlier has lowered the correlation coefficient. Graph IV shows no association between the two variables (X, Y) but an oultier has inflated the correlation higher.

 

Using the data in Figure 9.1 and the Equation 9.1 we find the Pearson’s correlation coefficient r=0.62. However, correlation does not mean causation.

Correlation is not causation

Any observed association is not necessarily assumed to be a causal one- it may be caused by other factors. Correlation indicated only association, but it does not indicate cause-effect association.

As an example, suppose we observe that people who daily drink more than four cups of coffee have a decreased chance of developing skin cancer. This does not necessarily mean that coffee confers resistance to cancer; one alternative explanation would be that people who drink a lot of coffee work indoors for long hours and thus have little exposure to the sun, a known risk. If this is the case, then the number of hours spent outdoors is a confounding variable—a cause common to both observations. In such a situation, a direct causal link cannot be inferred; the association merely suggests a hypothesis, such as a common cause, but does not offer proof. In addition, when many variables in complex systems are studied, spurious associations can arise. Thus, association does not imply causation (Altman and Krzywinski 2015).

 

Hypothesis Testing for Pearson’s \(r\) correlation coefficient

Step 1: Determine the appropriate null hypothesis and alternative hypothesis
  • The null hypothesis, \(H_{0}\), states that the population correlation, ρ, is zero (\(ρ = 0\)). There is not association between dbp and spb.

  • The alternative hypothesis, \(H_{1}\), states that the population correlation, ρ, is zero (\(ρ \neq 0\)). There is association between dbp and spb.

Step 2: Set the level of significance, α

We set the value α=0.05 for the level of significance (type I error).

Step 3: Identify the appropriate test statistic and check the assumptions. Calculate the test statistic.

To test whether ρ is significantly different from zero, \(ρ \neq 0\), we calculate the test statistic:

\[t = \frac{r}{SE_{r}}=\frac{r}{\sqrt{(1-r^2)/(n-2)}} \tag{9.2}\]

where n is the sample size and \(SE_{r}=\sqrt{ \frac{(1-r^2)}{(n-2)}}\).

For the data in our example, the number of observations are n= 96, r= 0.62 and \(SE_{r}=\sqrt{ \frac{(1-0.62^2)}{(96-2)}}= \sqrt{ \frac{(1-0.3844)}{94}} = \sqrt{\frac{0.6156}{94}}= 0.081\).

According to Equation 10.7:

\[t = \frac{r}{SE_{r}}= \frac{0.62}{0.081}= 7.65\]

Step 4: Decide whether or not the result is statistically significant

When we perform the test, we get a value for the t-statistic (here t= 7.65) that we compare with the t-distribution with n-2 degrees of freedom (here df=94). Using a statistical calculator for t-distribution (such as the distrACTION module from Jamovi), we can compute the probability \(Pr(T \geq 7.65)\). Then, the p-value for a two tailed test is \(2 \cdot Pr(T \geq 7.65)\). In our example, the p-value < 0.001 which is less than α=0.05 (so, we reject \(H_{0}\)).

Note that the significance of correlation also depends upon the sample size. If the sample size is large, even a weak correlation may be significant, and for a small sample size, even a strong association may or may not be significant.

To find a 95% confidence interval for ρ we have to use a Fisher’s z transformation to get a quantity \(Z_{r}\) that has approximately Normal distribution. The Fisher’s z transformation of sample correlation coefficient r is:

\[ Z_{r}= \frac{1}{2} ln \frac{1+r}{1-r} \tag{9.3}\]

The 95% CI of the \(Z_{r}\) is:

\[ 95\%CI= z_{r} \ \pm 1.96 \cdot SE{z_{r}}= z_{r} \ \pm \frac{1.96}{\sqrt{n-3}}=[z_{r_{L}}, z_{r_{U}}] \tag{9.4}\]

where \(SE{z_{r}}=\frac{1}{\sqrt{n-3}}\) and \(z_{r_{L}}, z_{r_{U}}\) the lower and upper limits of the 95%CI of \(Z_{r}\) respectively.

Finally, we invert the confidence limits of \(Z_{r}\); then the lower and upper limits of the 95%CI of ρ are:

\[ ρ_{L}= \frac{e^{2 \cdot z_{r_{L}}}-1}{e^{2 \cdot z_{r_{L}}}+1}= 0.48 \tag{9.5}\] \[ ρ_{L}= \frac{e^{2 \cdot z_{r_{U}}}-1}{e^{2 \cdot z_{r_{U}}}+1}= 0.73 \tag{9.6}\]

The 95% CI calculated from Equation 9.5 and Equation 9.6 is 0.48 to 0.73, so there are quite a wide range of plausible correlation values associated with these data. Additonally, note that the 95% CI of ρ is asymmetric.

Step 5: Interpret the results

There is evidence of a strong positive linear association between dbp and sbp (r= 0.62, 95% CI: 0.48 to 0.73, p < 0.001).

 

9.3 Rank correlation (Spearman’s \(r_{s}\) coefficient)

The basic idea of Spearman’s rank correlation is that the ranks of X and Y are obtained by first separately ordering their values from small to large and then computing the correlation between the two sets of ranks. The strength of correlation is denoted by the coefficient of rank correlation, named Spearman’s rank correlation coefficient, \(r_{s}\).

Assumptions for Spearman’s \(r_{s}\) coefficient
  1. The variables are observed on a random sample of individuals (each individual should have a pair of values).

  2. There is a monotonic association between the two variables (Figure 9.6 a and b)

In a monotonic association, the variables tend to move in the same relative direction, but not necessarily at a constant rate. So all linear correlations are monotonic but the opposite is not always true, because we can have also monotonic non-linear associations.

Figure 9.6: The association can be (a) linear monotonic (b) monotonic non-linear, or (c) non-monotonic.

 

Characteristics of Spearman’s rank correlation coefficient \(r\)

Formula

Suppose a set of \({n}\) pairs of observations \((x_{1},y_{1}),\ldots ,(x_{n},y_{n})\). Let \(x_{i}\) and \(y_{i}\) be arranged in ascending order, and the ranks of \(x_{i}\) and \(y_{i}\) in their respective order be denoted by \(R_{x_{i}}\) and \(R_{y_{i}}\), respectively. Spearman’s rank correlation coefficient of the sample is defined as:

\[r_{s} = \frac{\sum_{i=1}^n (R_{x_i} - \bar{R_{x}})(R{y_i} - \bar{R{y}})}{\sqrt{\sum_{i=1}^n (R_{x_i} - \bar{R_x})^2 \sum_{i=1}^n(R_{y_i} - \bar{R_{y}})^2}} \tag{9.7}\]

where \(\bar{R_{x}}= \frac{1}{n} \cdot \sum_{i=1}^n R_{x_i}\) and \(\bar{R_{y}}= \frac{1}{n} \cdot \sum_{i=1}^n R_{y_i}\)

Range of values

The interpretation of Spearman’s rank correlation coefficient \(r_{s}\) is similar to the Pearson correlation coefficient \(r\) , and \(r_{s}\) takes values from −1 to 1.

The closer \(r_{s}\) is to 0, the weaker is the correlation; \(r_{s}=1\) indicates a perfect correlation of ranks, \(r_{s}=-1\) indicates a perfect negative correlation of ranks, and \(r_{s}=0\) indicates no monotonic correlation between ranks.

Using the data in Figure 9.1 and the Equation 9.7 we find the Spearman’s correlation coefficient \(r_{s}= 0.65\).