Principal Components

Artificial Example

Forrest Young's Notes

Copyright © 1999 by Forrest W. Young.

  1. Creating the artificial Data

In this artificial example, we use two variables, X and Y, and three linear combinations of these two variables which are called A, B, and C. The data are shown below:

The three linear combinations are:

A = X + Y

B = 5X + Y

C = -2X + Y

Using ViSta 5.4.4 or later, you can create these linear combinations with the following code.

  1. Create the dataobject containing X and Y. This example assumes it is called PCAExample.
  2. Enter the code in the listener or in a LispEditor file. Evaluate it. It will produce dataobject shown above.
(data "PCAExmpl2" :use pcaexmpl
      :variables '("X" "Y" "A" "B" "C")
      (let* ((A (+       X  Y))
             (B (+ (*  5 X) Y))
             (C (+ (* -2 X) Y)))
    (list x y a b c)))

The correlations between these five variables are shown in this report:

Note that the total variance of X and Y equals:

Total Variance = Var X + Var Y = 12.76 + 6.00 = 18.76

The graph below shows the raw X and Y values as points, and has the direction corresponding to the linear combination A = X + Y drawn on it. This direction is found by drawing a line through the mean of the two variables (7.03 and 6.00) and the point which is the mean plus the coeficients of the linear combinations. For this linear combination the coeficients are 1 and 1, so the point is 8.03 and 7.00.

The amount of variance in X and Y which is accounted for by A, the arbitrary linear combination, is

RSQ(XA) * VAR(X) = .584 * 12.76 = 7.44
RSQ(YA) * VAR(Y) = .115 * 6.00 = .69
TOTAL = 7.44 + .69 = 8.13

Thus, this line, which doesn't look like it is anywhere near the point-cloud formed by the variables X and Y, never-the-less accounts for

8.13/18.76 = 43%

of the variance in X and Y!

The Linear Combination B = 5X + Y

This linear combination is shown below:

This linear combination accounts for 89% of the variance of X and Y!

The Linear Combination C = -2X + Y

This linear combination is shown below:

This one accounts for 17.85 units of the 18.76 total units of variance, which is 94%.

The Principal Component (the Linear Combination that accounts for the maximum variance possible)

Once we have obtained the first principal component of A and B using ViSta (as explained below), we can make this plot:

Clearly, to the eye this line does a much better job of fitting the point-cloud. However, if we look at the variance accounted for, it is a total of 17.6732 units of the total of 18.7566 units of variance, which is 94.22%.

While it looks like a lot better fit, mathematically it has only improved from very marginally from the 17.65 units of variance (94.10%) fit by the arbitrary linear combination C.

Using the Scatterplot's Curves Button to show the Principal Components

The scatterplot has a "Curves" button, which, when clicked, shows a dialog box that has numerous curves that can be added. One of these is called the Linear Smoother, which adds the first principal component of the two variables shown in the plot. Another choice is Normal Contours, which adds elipses representing the bivariate normal contours, with axes which are the first and second principle component. Both options are shown in the following plot:

The heavy line is the Linear Smoother. It is obscuring, for the most part, the first axis of the Normal Contours. The second axis is the second principal component.

The second axis is orthogonal to the first, but does not appear as being at right angles because the variables X and Y are not in the same units in the figure. When the horizontal axis is stretched to make it have the same units as the vertical axis, the two component lines appear to be at right angles, as can be seen in the next figure:

Using PRNCMP to compute the Principal Components.

We can use PRNCMP to do the Principal Components. When completed, it provides the following report:

We use the coefficients .8388 and -.5444 to plot the line representing the first principal component. The Eigenvalues (E-values) give us the desired variance accounted for information.

The Visualization's Biplot is shown below. The biplot has axes that are the two principal components. The horizontal axis is the first PC: It shows the direction with the most variation in the point cloud. The red lines are the original X and Y variables. This plot shows a rotation of the original data to the principal component orientation.