In this markdown we show step by step how to reduce the dimensionality of data using principle components analysis via singular value decompositions. The source can be found here. The grade data used is here.
Principle Components Analysis
Ryan G. McClarren
November 3, 2014
Assume you have a lot of correlated data: what to do?
A simple 4-dimensional example
First, let’s generate multivariate normal random samples from the 4-D space. We have to create a covariance matrix and make sure it is positive definite:
Ndims = 4
Nsamples = 25
Sigma <- matrix(runif(Ndims^2,min = 0.1,max=10),Ndims,Ndims)
Sigma <- Sigma %*% t(Sigma)
samples <- data.frame(mvrnorm(n=Nsamples, runif(4,min=-2,max=2), Sigma))
Looking at the data with ggpairs shows that the dimensions are correlated with each other:
ggpairs(samples)
The next thing we do is compute the singular value decomposition of the data matrix, in this case, samples. It is useful to make the columns of the matrix have a zero mean to help interpret the singular values.
#create mean 0 columns
norm_samples <- samples
columnMeans <- 0
for (i in 1:Ndims){
columnMeans[i] <- mean(norm_samples[,i])
norm_samples[,i] <- norm_samples[,i] - columnMeans[i]
}
#compute SVD
sv_decomp <- svd(norm_samples)
svDF = data.frame(singular_values = sv_decomp$d, sv = 1:Ndims)
svDF$cumulativeFrac <- sv_decomp$d/sum(sv_decomp$d)
ggplot(svDF,aes(x=as.factor(sv),y=singular_values, label=paste(round(cumulativeFrac*100,2),"%"))) + geom_bar(stat="identity") + scale_x_discrete("Singular Value #") + scale_y_continuous("Singular Value") + geom_text(color="red", aes(y=(sv_decomp$d)*.5))
From the figure, we see that the first singular value represents 62.73% of the cumulative sum of the singular values. One way to interpret this is the that 62.73% of the variability of the data is represented by a singular principle component. The sum of the first two singular values is 85.22% of the cumulative sum of the singular values. That is, about 85.22% of the total variability in the data is represented by two components.
The upshot of this is that if we replace the 4-d space with an appropriate 2-d space, we can capture most of the affect of the 4-d space.
One PC Reconstuction
Reconstructing the data using only one principle component requires using the singular value decomposition formula with all of the singular values, besides the first, being set to zero.
#one_var reconstruction
normData1 <- sv_decomp$u %*% diag(c(sv_decomp$d[1],rep(0,Ndims-1))) %*% t(sv_decomp$v)
origScale1 <- data.frame(normData1)
for (i in 1:Ndims){
origScale1[,i] <- origScale1[,i] + columnMeans[i]
}
plotDF <- cbind(samples,origScale1)
colnames(plotDF) <- c("O1","O2","O3","O4","X1","X2","X3","X4")
ggplot(plotDF,aes(x=O1,y=X1)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 1, RMS Error:", round(sqrt(mean((plotDF$O1 - plotDF$X1)^2)),2)))
ggplot(plotDF,aes(x=O2,y=X2)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 2, RMS Error:", round(sqrt(mean((plotDF$O2 - plotDF$X2)^2)),2)))
ggplot(plotDF,aes(x=O3,y=X3)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 3, RMS Error:", round(sqrt(mean((plotDF$O3 - plotDF$X3)^2)),2)))
ggplot(plotDF,aes(x=O4,y=X4)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("1 Principle Component Reconstruction") + ggtitle(paste("Dimension 4, RMS Error:", round(sqrt(mean((plotDF$O4 - plotDF$X4)^2)),2)))
Two PC Reconstuction
We can do much better using 2 principle components:
#one_var reconstruction
normData2 <- sv_decomp$u %*% diag(c(sv_decomp$d[1:2],rep(0,Ndims-2))) %*% t(sv_decomp$v)
origScale2 <- data.frame(normData2)
for (i in 1:Ndims){
origScale2[,i] <- origScale2[,i] + columnMeans[i]
}
plotDF <- cbind(samples,origScale2)
colnames(plotDF) <- c("O1","O2","O3","O4","X1","X2","X3","X4")
ggplot(plotDF,aes(x=O1,y=X1)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 1, RMS Error:", round(sqrt(mean((plotDF$O1 - plotDF$X1)^2)),2)))
ggplot(plotDF,aes(x=O2,y=X2)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 2, RMS Error:", round(sqrt(mean((plotDF$O2 - plotDF$X2)^2)),2)))
ggplot(plotDF,aes(x=O3,y=X3)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 3, RMS Error:", round(sqrt(mean((plotDF$O3 - plotDF$X3)^2)),2)))
ggplot(plotDF,aes(x=O4,y=X4)) + geom_point() + geom_smooth(method="lm",se=F) + scale_x_continuous("Original Data") + scale_y_continuous("2 Principle Component Reconstruction") + ggtitle(paste("Dimension 4, RMS Error:", round(sqrt(mean((plotDF$O4 - plotDF$X4)^2)),2)))
Correcting Co-linearity with Principle Components
Given data that are nearly perfectly correlated, simple linear regression models can fall on their face. Here we create data that are nearly collinear, and then produce a dependent variable that has the simple formula
dHW <- data.frame(x1=0,x2=0,y=0)
dHW[1:20,"x1"] <- rnorm(20)
dHW[1:20,"x2"] <- rnorm(20,mean=dHW$x1,sd=0.01)
dHW[1:20,"y"] <- rnorm(20,mean=dHW$x1 + dHW$x2 + 3)
ggplot(dHW, aes(x=x1, y = x2)) + geom_point()
Running R’s standard linear model on this gives something pretty bad:
summary(simpMod <- lm(y ~ x1 + x2, data = dHW))
##
## Call:
## lm(formula = y ~ x1 + x2, data = dHW)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9474 -0.5991 0.0162 0.5517 1.7395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.830 0.214 13.22 2.3e-10 ***
## x1 16.372 28.057 0.58 0.57
## x2 -13.857 28.012 -0.49 0.63
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.941 on 17 degrees of freedom
## Multiple R-squared: 0.896, Adjusted R-squared: 0.884
## F-statistic: 73.3 on 2 and 17 DF, p-value: 4.38e-09
In other words we get
We can fix this with principle components. First, we compute the SVD,
svdLin <- svd(dHW[,c("x1","x2")])
Then we multiply the original data by the matrix
print(svdLin$v)
## [,1] [,2]
## [1,] 0.7065 -0.7077
## [2,] 0.7077 0.7065
newData <- dHW
newData[,c("x1","x2")] <- as.matrix(dHW[,c("x1","x2")]) %*% svdLin$v
Finally, we build the model on the new variables
summary(pcMod <- lm(y ~ x1 + x2, data = newData))
##
## Call:
## lm(formula = y ~ x1 + x2, data = newData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.9474 -0.5991 0.0162 0.5517 1.7395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.830 0.214 13.22 2.3e-10 ***
## x1 1.761 0.146 12.09 8.9e-10 ***
## x2 -21.376 39.647 -0.54 0.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.941 on 17 degrees of freedom
## Multiple R-squared: 0.896, Adjusted R-squared: 0.884
## F-statistic: 73.3 on 2 and 17 DF, p-value: 4.38e-09
Now
The solution to this equation gives that the coefficient for
How should we measure success
Consider the following data from a course.
grades <- read.csv("PostExam2Grades.csv")
summary(grades)
## Quiz.1 Exam.1 HW.1 HW.2
## Min. : 0.0 Min. :49.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.0 1st Qu.:68.5 1st Qu.:29.8 1st Qu.: 59.5
## Median :11.0 Median :74.5 Median :38.5 Median : 77.5
## Mean :10.0 Mean :75.4 Mean :35.0 Mean : 71.8
## 3rd Qu.:14.2 3rd Qu.:82.2 3rd Qu.:43.0 3rd Qu.: 90.0
## Max. :20.0 Max. :96.0 Max. :55.0 Max. :110.0
## HW.3 Quiz.2 Quiz.3 HW.4
## Min. : 0.0 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.: 67.5 1st Qu.: 7.50 1st Qu.: 5.00 1st Qu.:83.0
## Median : 89.0 Median :10.00 Median : 7.00 Median :90.0
## Mean : 74.6 Mean : 8.11 Mean : 6.47 Mean :80.9
## 3rd Qu.: 98.0 3rd Qu.:10.00 3rd Qu.:10.00 3rd Qu.:90.0
## Max. :100.0 Max. :10.00 Max. :10.00 Max. :95.0
## Exam.2
## Min. : 39.0
## 1st Qu.: 74.8
## Median : 86.5
## Mean : 84.5
## 3rd Qu.: 95.0
## Max. :115.0
ggplot(grades,aes(x=Exam.1,y=Exam.2, color = Quiz.1 + Quiz.2 + Quiz.3, size = HW.1 + HW.2 + HW.3 + HW.4)) + geom_point() + scale_colour_gradient2("Quiz Total",low="red",mid="orange",high="blue",midpoint=20) + scale_x_continuous("Exam 1 Score", lim=c(40,100)) + scale_y_continuous("Exam 2 Score") + scale_size(range=c(1,12), "Homework Total")
There are 9 total grades, grouped into 3 categories: homework, quizzes and exams. Let’s see how the different grades vary together:
ggpairs(grades)
Let’s throw a principle components analysis at this. We will use the built in R function prcomp<>, and scale the data because the grades have different scales, e.g., quizzes are out of 10 and exams are out of 100.
print(gradePC <- prcomp(grades,scale=T))
## Standard deviations:
## [1] 2.0424 1.0808 1.0283 0.8683 0.7418 0.6328 0.6198 0.5367 0.4756
##
## Rotation:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Quiz.1 0.3215 -0.2183 0.44268 -0.07590 0.60799 -0.415984 0.1511
## Exam.1 0.3433 -0.3539 0.33747 -0.07707 0.04394 0.679980 -0.1107
## HW.1 0.3310 0.3706 0.42528 -0.18182 -0.25607 0.062867 -0.2030
## HW.2 0.3749 0.3767 -0.04686 -0.06509 -0.08394 -0.377252 -0.4672
## HW.3 0.3955 0.2741 -0.23545 0.14641 -0.08953 0.312574 0.1321
## Quiz.2 0.2937 -0.3609 -0.52117 0.16885 0.26490 0.006468 -0.5149
## Quiz.3 0.2620 -0.3657 -0.26255 -0.69879 -0.37759 -0.192117 0.2432
## HW.4 0.3617 0.3221 -0.30067 0.06354 0.29876 0.076804 0.5636
## Exam.2 0.2942 -0.3240 0.14388 0.63931 -0.49576 -0.278856 0.2148
## PC8 PC9
## Quiz.1 -0.16210 0.23550
## Exam.1 0.40003 -0.02194
## HW.1 -0.50752 -0.41335
## HW.2 0.56776 0.14532
## HW.3 -0.31754 0.68382
## Quiz.2 -0.29522 -0.24669
## Quiz.3 -0.02433 0.03436
## HW.4 0.20757 -0.46386
## Exam.2 0.04591 -0.08524
summary(gradePC)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 2.042 1.081 1.028 0.8683 0.7418 0.6328 0.6198 0.537
## Proportion of Variance 0.463 0.130 0.117 0.0838 0.0611 0.0445 0.0427 0.032
## Cumulative Proportion 0.463 0.593 0.711 0.7945 0.8557 0.9002 0.9429 0.975
## PC9
## Standard deviation 0.4756
## Proportion of Variance 0.0251
## Cumulative Proportion 1.0000
Based on the principle components analysis, there are only 5-7 grades hidden in the data. We can look at the rotation matrix to interpret the data. For instance, PC1 is a measure of the overall performance of the student (it is a weighted sum of all the grades). Therefore, the higher this score, the higher the overall score of the student.
PC2 is negative for exams and quizzes and positive for homeworks. Students with a high PC2 outperformed on homework relative to exams and quizzes; a negative PC2 score would indicate they rose to the occasion on quizzes and exams. The presence of a measure such as this indicates that there was a measureable partition of the class that had a different type of performance on homework versus quizzes and exams.
The other PC’s can be interpreted also. The benefit of this kind of approach is that it makes the story the data is telling, a bit more clear because you have transformed to a uncorrelated basis. As we saw in the figures above, There are strong correlations in the data.
// add bootstrap table styles to pandoc tables $(document).ready(function () { $('tr.header').parent('thead').parent('table').addClass('table table-condensed'); });