Practical Machine Learning Course Project

Exploratory Data Analysis

The analysis was done in the R programming language. I began by loading and initializing a number of R packages

## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
## Loading required package: doParallel
## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## Loading required package: corrplot
## Loading required package: FactoMineR

Next I loaded the data and removed columns that were almost entirely NA or blank. I also removed time and factor variables. I kept the classe variable that was the target factor that the model was being trained to predict.

training <- read.csv("pml-training.csv", header=TRUE, na.strings=c("NA",""))
testing <- read.csv("pml-testing.csv", header=TRUE, na.strings=c("NA",""))
clean <- function(data) {
        for(i in ncol(data):1) {
                if(is.na(data[1,i])) {
                        data <- data[,-i]
                }
        }
        data
}

# clean the training data set
training <- clean(training)
training$classe <- as.factor(training$classe)
training <- training[,-(1:7)]

# clean the test data set
testing <- clean(testing)
testing <- testing[,-(1:7)]
testing <- testing[,-53]

This resulted in 52 quantitative variables and one result variable.

Next I split the training data set into train and test subsets to create and evaluate my predictive model.

inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
traintrain <- training[inTrain,]
traintest <- training[-inTrain,]

Principal Component Analysis

Rather than using all 52 variables, I sought to reduce the number of model variables. I began be looking the correlation between different variables. Using the corrplot package, I created a plot of variable cross-correlation.

# look for highly correlated variables
train.scale<- scale(traintrain[1:(ncol(traintrain)-1)],center=TRUE,scale=TRUE)
cortrain <- cor(train.scale)
highlyCor <- findCorrelation(cortrain, 0.9)
#Apply correlation filter at 0.90,
#then we remove all the variable correlated with more 0.9.
trainFiltered.scale <- train.scale[,-highlyCor]
cortrainFilt <- cor(trainFiltered.scale)
corrplot(cortrainFilt, order = "hclust",tl.cex=0.5)

plot of chunk corrplot From this plot, it is clear that several variables are highly correlated, both positively and negatively. For example, ‘accell_dumbell_z’ is strongly positively correlated with ‘yaw_dumbell’. Similarly, ‘magnet_belt_x’ is strongly negatively correlated with ‘pitch_belt’. This indicates that an accurate model should be possible with a subset of variables.

As an additional test of variable influence, I used the FactoMineR package to perform a principal component analysis. This generates a factor map of variables indicating how close they are to each other. The longer the vector, the more it contributes to variance. This information is further summarized to provide an indication of the relative variance of each variable

# PCA with function PCA
par(mfrow = c(1, 2))
trainpca <- PCA(traintrain[-53], scale.unit=TRUE, ncp=5, graph=TRUE)

plot of chunk FactoMineR This analysis indicates that the 10 variables that most affect the outcome variable are:

roll_belt
pitch_belt
yaw_belt
total_accel_belt gyros_belt_x
gyros_belt_y
gyros_belt_z
accel_belt_x
accel_belt_y
accel_belt_z

Model Development

I used the caret package to model the data. I preprocessed the data sets with a threshold setting of 90%. The preProcess function resulted in components being selected and normalized for further analysis.

preProc <- preProcess(traintrain[,-53], method="pca", thresh=0.9)
preProc

## 
## Call:
## preProcess.default(x = traintrain[, -53], method = "pca", thresh = 0.9)
## 
## Created from 13737 samples and 52 variables
## Pre-processing: principal component signal extraction, scaled, centered 
## 
## PCA needed 18 components to capture 90 percent of the variance

trainPC <- predict(preProc, traintrain[,-53])
testPC <- predict(preProc, traintest[-53])

The following plot shows the top 2 principal components colored by the classe variable. Despite the fact that there are 5 clear clusters, they do not align directly with the classe variable.

plot of chunk PCplot

I then ran a random forest analysis against the training subset. I included cross validation with 5 folds.

tc <- trainControl(method = "cv", number=5, repeats=5)
modFit <- train(traintrain$classe ~ ., method="rf",
                trControl=tc, data=trainPC, importance=TRUE)
modFit$resample

##   Accuracy  Kappa Resample
## 1   0.9716 0.9641    Fold1
## 2   0.9644 0.9549    Fold2
## 3   0.9632 0.9535    Fold5
## 4   0.9621 0.9521    Fold4
## 5   0.9658 0.9567    Fold3

modFit$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 2.83%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3864   13   20    7    2     0.01075
## B   49 2548   50    2    9     0.04138
## C    9   31 2322   32    2     0.03088
## D    6    4   99 2136    7     0.05151
## E    3    8   20   16 2478     0.01861

For the sake of rigor, I repeated this analysis multiple times with different preprocessing threshold values. This resulted in a range of out of sample error rates. These are summarized in the following table.

Threshold	NumComponents	ErrorRatePct
80	13	3.87
85	16	3.07
90	20	2.83
95	26	2.58
99	37	2.04

I also tried varying the number of folds and repetitions for cross-validation to see if I could reduce the out of sample error. (All runs were done with a threshold value of 90%). As the following table shows, however, this had virtually no impact on model performance.

Folds	ErrorRatePct
3	2.64
5	2.83
10	2.80
15	2.65
20	2.81
25	2.72

Model Performance

For model evaluation, I selected the model with a threshold value of 90%. I applied the model to the portion of the original training model that I set aside for testing. Results were as follows.

pred <- predict(modFit, newdata=testPC)

## Loading required package: randomForest
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.

predright <- pred == traintest$classe
table(pred, traintest$classe)

##     
## pred    A    B    C    D    E
##    A 1654   21    2    3    0
##    B    9 1103   24    0    5
##    C   10   12  987   34    7
##    D    1    1   12  924    3
##    E    0    2    1    3 1067

This indicates an error rate of 2.55%, which is actually slightly lower than the training set.

Finally, I applied the prediction to the set of 20 test cases provided. The results were as follows.

testPC <- predict(preProc, testing)
pred <- predict(modFit, testPC)
pred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Based on submission to the online grading system, all values were correct.

Practical Machine Learning Course Project

Ephraim Baron

July 26, 2014

Executive Summary

Exploratory Data Analysis

Principal Component Analysis

Model Development

Model Performance