Summary

We will try to build machine learning algorithm to predicting human activity using data from wearable sensors. We will use HAR dataset: http://groupware.les.inf.puc-rio.br/har. Downloadable data files: training and testing

Partitioning and exploration

all.data <- read.csv('pml-training.csv')
to.predict <- read.csv('pml-testing.csv')

library(caret)
set.seed(1234)
inTraining <- createDataPartition(all.data$X, p = 0.8, list = F)
training <- all.data[inTraining,]
testing <- all.data[-inTraining,]

summary(training)

Summary shows that for some columns almost all (more than 95%) values are the same (NA, empty, ‘no’). For example:

summary(training[,c('new_window', 'skewness_roll_belt', 'amplitude_pitch_belt')])
##  new_window  skewness_roll_belt amplitude_pitch_belt
##  no :15381           :15381     Min.   : 0.000      
##  yes:  317   #DIV/0! :    8     1st Qu.: 1.000      
##              0.000000:    4     Median : 1.000      
##              0.003969:    1     Mean   : 2.114      
##              0.005334:    1     3rd Qu.: 2.000      
##              0.005478:    1     Max.   :11.000      
##              (Other) :  302     NA's   :15381

Building a model

Let’s try the easy way first:

usefulness.treshold <- nrow(training) * 0.95
useful.columns <- apply(training, 2, function(x){
      sort(table(x, useNA='always'), decreasing=T)[1] < usefulness.treshold})
useful.columns[1:7] <- F; # remove user names and time data

training <- training[,useful.columns]

We’ll start with random forest because it’s pretty powerful, easy to apply of-the-shelf method. To get some numbers quickly we’ll start with by-passing time consuming caret’s train and use directly randomForest (maybe we’ll get a model that is good enough). But this method doesn’t support NAs. Let’s see if we have to deal with any NAs in our data.

table(c( complete.cases(all.data[,useful.columns]), complete.cases(to.predict[,useful.columns]) ))
## 
##  TRUE 
## 19642

All rows are complete (training, testing and data for new predictions) so we don’t have to do any pre-processing like removing rows or imputing values. We can train the model:

library(randomForest)
system.time(fit <- randomForest(classe ~ ., data=training))
fit
##    user  system elapsed 
##  67.570   1.110  77.922 
## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.41%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4479    2    0    0    1 0.000669344
## B   12 3004    4    0    0 0.005298013
## C    0   10 2724    3    0 0.004749726
## D    0    0   24 2554    2 0.010077519
## E    0    0    1    6 2872 0.002431400

Random forest internally does cross-validation to estimate error rate. In our case it’s 0.41%. We consider this model to be good enough so we will not go back to use more time consuming train with bootstrapping.

Out of sample error

Let’s do a final check and see how good our model performs on completely new data:

predicted.classe <- predict(fit, testing)
confusionMatrix(predicted.classe, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1098    4    0    0    0
##          B    0  773    4    0    0
##          C    0    0  681    9    1
##          D    0    0    0  627    2
##          E    0    0    0    0  725
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9921, 0.9969)
##     No Information Rate : 0.2798          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9949   0.9942   0.9858   0.9959
## Specificity            0.9986   0.9987   0.9969   0.9994   1.0000
## Pos Pred Value         0.9964   0.9949   0.9855   0.9968   1.0000
## Neg Pred Value         1.0000   0.9987   0.9988   0.9973   0.9991
## Prevalence             0.2798   0.1980   0.1746   0.1621   0.1855
## Detection Rate         0.2798   0.1970   0.1735   0.1598   0.1848
## Detection Prevalence   0.2808   0.1980   0.1761   0.1603   0.1848
## Balanced Accuracy      0.9993   0.9968   0.9955   0.9926   0.9979

Out of sample error is: 0.51%

Predictions

predict(fit, to.predict)