The purpose of this report is to analyze data from a particular type of physical activity to determine if it is being performed in accordance with a qualitative standard. This project is based on the work done by Velloso, E.; Bulling, A., et. al entitled “Qualitative Activity Recognition of Weight Lifting Exercises”. Source data is available at http://groupware.les.inf.puc-rio.br/har
Two data sets were provided, a training set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
and a test set: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The goal was to construct an efficient model for predicting how an exercise was being performed based on sensor data. The final model selected was able to classify activities with an out of sample error of less than 3%.
The analysis was done in the R programming language. I began by loading and initializing a number of R packages
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: doParallel
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: corrplot
## Loading required package: FactoMineR
Next I loaded the data and removed columns that were almost entirely NA or blank. I also removed time and factor variables. I kept the classe variable that was the target factor that the model was being trained to predict.
training <- read.csv("pml-training.csv", header=TRUE, na.strings=c("NA",""))
testing <- read.csv("pml-testing.csv", header=TRUE, na.strings=c("NA",""))
clean <- function(data) {
for(i in ncol(data):1) {
if(is.na(data[1,i])) {
data <- data[,-i]
}
}
data
}
# clean the training data set
training <- clean(training)
training$classe <- as.factor(training$classe)
training <- training[,-(1:7)]
# clean the test data set
testing <- clean(testing)
testing <- testing[,-(1:7)]
testing <- testing[,-53]
This resulted in 52 quantitative variables and one result variable.
Next I split the training data set into train and test subsets to create and evaluate my predictive model.
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
traintrain <- training[inTrain,]
traintest <- training[-inTrain,]
Rather than using all 52 variables, I sought to reduce the number of model variables. I began be looking the correlation between different variables. Using the corrplot package, I created a plot of variable cross-correlation.
# look for highly correlated variables
train.scale<- scale(traintrain[1:(ncol(traintrain)-1)],center=TRUE,scale=TRUE)
cortrain <- cor(train.scale)
highlyCor <- findCorrelation(cortrain, 0.9)
#Apply correlation filter at 0.90,
#then we remove all the variable correlated with more 0.9.
trainFiltered.scale <- train.scale[,-highlyCor]
cortrainFilt <- cor(trainFiltered.scale)
corrplot(cortrainFilt, order = "hclust",tl.cex=0.5)
From this plot, it is clear that several variables are highly correlated, both positively and negatively. For example, ‘accell_dumbell_z’ is strongly positively correlated with ‘yaw_dumbell’. Similarly, ‘magnet_belt_x’ is strongly negatively correlated with ‘pitch_belt’. This indicates that an accurate model should be possible with a subset of variables.
As an additional test of variable influence, I used the FactoMineR package to perform a principal component analysis. This generates a factor map of variables indicating how close they are to each other. The longer the vector, the more it contributes to variance. This information is further summarized to provide an indication of the relative variance of each variable
# PCA with function PCA
par(mfrow = c(1, 2))
trainpca <- PCA(traintrain[-53], scale.unit=TRUE, ncp=5, graph=TRUE)
This analysis indicates that the 10 variables that most affect the outcome variable are:
roll_belt
pitch_belt
yaw_belt
total_accel_belt gyros_belt_x
gyros_belt_y
gyros_belt_z
accel_belt_x
accel_belt_y
accel_belt_z
I used the caret package to model the data. I preprocessed the data sets with a threshold setting of 90%. The preProcess function resulted in components being selected and normalized for further analysis.
preProc <- preProcess(traintrain[,-53], method="pca", thresh=0.9)
preProc
##
## Call:
## preProcess.default(x = traintrain[, -53], method = "pca", thresh = 0.9)
##
## Created from 13737 samples and 52 variables
## Pre-processing: principal component signal extraction, scaled, centered
##
## PCA needed 18 components to capture 90 percent of the variance
trainPC <- predict(preProc, traintrain[,-53])
testPC <- predict(preProc, traintest[-53])
The following plot shows the top 2 principal components colored by the classe variable. Despite the fact that there are 5 clear clusters, they do not align directly with the classe variable.
I then ran a random forest analysis against the training subset. I included cross validation with 5 folds.
tc <- trainControl(method = "cv", number=5, repeats=5)
modFit <- train(traintrain$classe ~ ., method="rf",
trControl=tc, data=trainPC, importance=TRUE)
modFit$resample
## Accuracy Kappa Resample
## 1 0.9716 0.9641 Fold1
## 2 0.9644 0.9549 Fold2
## 3 0.9632 0.9535 Fold5
## 4 0.9621 0.9521 Fold4
## 5 0.9658 0.9567 Fold3
modFit$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.83%
## Confusion matrix:
## A B C D E class.error
## A 3864 13 20 7 2 0.01075
## B 49 2548 50 2 9 0.04138
## C 9 31 2322 32 2 0.03088
## D 6 4 99 2136 7 0.05151
## E 3 8 20 16 2478 0.01861
For the sake of rigor, I repeated this analysis multiple times with different preprocessing threshold values. This resulted in a range of out of sample error rates. These are summarized in the following table.
Threshold | NumComponents | ErrorRatePct |
---|---|---|
80 | 13 | 3.87 |
85 | 16 | 3.07 |
90 | 20 | 2.83 |
95 | 26 | 2.58 |
99 | 37 | 2.04 |
I also tried varying the number of folds and repetitions for cross-validation to see if I could reduce the out of sample error. (All runs were done with a threshold value of 90%). As the following table shows, however, this had virtually no impact on model performance.
Folds | ErrorRatePct |
---|---|
3 | 2.64 |
5 | 2.83 |
10 | 2.80 |
15 | 2.65 |
20 | 2.81 |
25 | 2.72 |
For model evaluation, I selected the model with a threshold value of 90%. I applied the model to the portion of the original training model that I set aside for testing. Results were as follows.
pred <- predict(modFit, newdata=testPC)
## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
predright <- pred == traintest$classe
table(pred, traintest$classe)
##
## pred A B C D E
## A 1654 21 2 3 0
## B 9 1103 24 0 5
## C 10 12 987 34 7
## D 1 1 12 924 3
## E 0 2 1 3 1067
This indicates an error rate of 2.55%, which is actually slightly lower than the training set.
Finally, I applied the prediction to the set of 20 test cases provided. The results were as follows.
testPC <- predict(preProc, testing)
pred <- predict(modFit, testPC)
pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Based on submission to the online grading system, all values were correct.