- The Pearson Correlation Coefficient
Definition: The Pearson correlation measures
the degree and direction of the linear relationship
between two variables.
- The Pearson correlation coefficient is by far
the most common measure of correlation is the Pearson
- Notation: The Pearson correlation is denoted
by the letter r.
We begin by introducing the formulas for the Pearson
- Conceptual Formula
Conceptually, the Pearson Correlation is
the ratio of the variation shared by X and
Y to the variation of X and Y separately.
The conceptual formula is:
Stated in statistical terminology:
When there is a perfect linear relationship,
every change in the X variable is accompanied
by a corresponding change in the Y variable.
In this case, all variation in X is shared
with Y, so the ratio given above is r=1.00.
At the other extreme, when there is no linear
relationship between X and Y, then the numerator
is zero, so r=0.00.
- Sum of Products of Deviations
--- The measure of shared variability:
To calculate the Pearson correlation, it
is necessary to introduce one new concept:
The sum of the products of corresponding
deviation scores for two variables.
|The sum of the squares is a
similar concept we've already seen.
The sum of the squares of the deviation
scores for a variable is used to measure
the amount of variability of a single
The sum of products, which is used
to measure the variability shared between
two variables, is defined as:
Note that the name is short for the sum
of the products of corresponding deviation
scores for two variables.
To calculate the SP, you first determine
the deviation scores for each X and for each
Y, then you calculate the products of each
pair of deviation scores, and then (last)
you sum the products.
- The Algebraic Formula:
As noted above, conceptually the Pearson
correlation coefficient is the ratio of the
joint covariability of X and Y, to the variability
of X and Y separately. The formula uses the
SP as the measure of covariability, and the
square root of the product of the SS for X
and the SS for Y as the measure of separate
variability. That is:
- Z-Scores and Pearson Correlation:
If we have scores that are expressed as
standardized scores -- Z-Scores with a mean
of zero and a variance of one -- then the
formula for the Pearson Correlation becomes
particularly simple. It is:
Understanding and Interpreting
the Pearson Correlation Coefficient
- Correlation is NOT Causation!
One of the most common errors made in interpreting
correlations is to assume that a correlation
necessarily implies a cause-and-effect relationship
between the two variables. Simply stated:
Correlation is NOT Causation!
- Correlation and Restricted
When a correlation is computed from scores
with a restricted range the correlation coefficient
is lower than it would be if it were computed
from scores with an unrestricted range.
This happens when we look at the correlation
between SAT and GPA among students in this
class, since we are only seeing those students
who were admitted to UNC. Those with low SAT
scores (who presumably would have had very
low GPA scores) were not admitted. Thus, we
have a restricted range of observed SAT scores,
and a lower correlation.
Try this ViSta Correlation
Demonstration Applet for a demonstration
of the effect of restriction of range on the
value of the correlation.
The applet produces graphics like those
Automobile Weight and Horsepower
| Unrestricted Relationship
The unrestricted relationship between
Automobile weight and Horsepower
is shown in the scatterplot at the
The Pearson Correlation is .92.
| Restricting the Range:
If we lived in a country that restricted
cars to have no more than 100 Hp,
then the data would be cut off like
| Restricted Range:
What we would see as the relationship
in a country where the maximum horsepower
is 100 Hp, would be only based on
the cars with less than 100 HP.
We would see the scatterplot at
Now the correlation is only
.71, rather than .92.
- Outliers (Outriders?)
Outliers (which G&W call, for some unknown
reason, outriders) are an individual observation
that has very large values of X and Y relative
to all the other values of X and Y. For example,
in this scatterplot of the Market Value of
many companies plotted versus their Assets,
the fact that IBM is so large compared to
any other company completely obscures the
relationship of the two variables to each
Outliers and Pearson Correlation
| The correlation for these variables
is .68, which is spuriously high.
In fact, the correlation is reduced
to .48 when IBM is removed.
- Leverage (Influence) Points
Some points in a scatterplot can have a
large influence on the value of the correlation
coefficient. These points may possibly be
outliers, but not all outliers are leverage
points, nor are all leverage points outliers.
The position of the regression curve is
heavily influenced by observations that are
extreme in their value on the X-variable.
The correlation coefficient is heavily influenced
by values extreme in their value on either
Try this ViSta Correlation
Demonstration Applet for a demonstration
of the effect of leverage points on the value
of the correlation.
- Correlation and Strength of
The Pearson correlation measures the degree
of relationship between two variables. It
is not, however, interpreted as a percentage.
On the other hand:
- The Coefficient of Determination:
- The Coefficient of Determination, which
is the squared correlation coefficient,
measures the percentage of variation
shared between the two variables.
Hypothesis Testing with the
Pearson Correlation Coefficient
The Pearson Correlation coefficient is usually
computed for sample data. Oftentime we wish to
make inferences from the sample correlation to
a value for the correlation in the population.
We can use standard inference testing techniques
to make this inference.
The basic question answered by the hypothesis
testing procedure for the Pearson correlation
coefficient is whether it is significantly different
from zero: i.e., whether or not a non-zero correlation
exists in the population.
Here are the four standard hypothesis testing
steps, as augmented by a visualization step for
- State the hypotheses
The hypotheses concern whether or not there
exists a non-zero correlation in the population.
We have a 1-tailed hypothesis:
There are also two possible 1-tailed hypothesis.
Here's one of them:
- Set the decision criterion
Choose an alpha level.
The df=n-2, where n is the number of pairs.
- Gather the data
Lets use the data gathered in class about
SAT-M and GPA. We can also use the SAT-V and
GPA correlation. We observe that
- The MathSAT/GPA correlation is .32 --
10% of the variance in GPA is explained
- The VerbalSAT/GPA correlation is .47 --
22% of the variance in GPA is explained
- Remember that these correlations have
been attenuated by restriction of range.
- If we had a larger sample, and these correlation
values stayed about the same, then they
would become significant. However, significance
isn't everything, as the size of the correlation
tells us how strong the relationship is.
- Visualize the data
Here are the two scatterplots:
- Evaluate the Hypothesis
For df=39, alpha=.05, we can interpolate
to find the the critical one-tailed r.
Note that we don't really need to interpolate
since both observed correlations (.32 and .47)
are beyond the larger critical value of .275.
- For df = 35 the critical r = .275.
- For df = 40 the critical r = .257.
- For df = 39 the critical r
= .275 - (.275 - .257)*(4/5)
= .275 - .014
Therefore, for both relationships, we reject
the null hypothesis that there is not a positive
correlation in the population (in plain English,
we say these correlations are significantly
different from zero). The two SAT scales DO
significantly predict your GPA's.
Pearson Correlation and ViSta
ViSta can compute and report Pearson (as well
as Spearman, Point-Biserial or Phi correlations)
but it does not do significance testing for the
computed correlation coefficients.
The ViSta Applet demonstrates
that you can compute Pearson correlations in two
Scatterplots of the relationship between
the two variables being correlated can be obtained
by asking for a Data Visualization when
only the two variables in question are selected.
You do this by
- Clicking on the desired data icon to make
- Opening the Var window (use the List
Variables item of the Data menu).
- Selecting, in the Var window, the two
variables you want to show in the scatterplot.
- Using the Data menu's Visualize
This will give you a scatterplot like those
shown above. Note that it will not have a diagonal
regression line on it, which is produced by the
regression procedure, as discussed in the regression