Human activity recognition using wearable sensors

Summary

We will try to build machine learning algorithm to predicting human activity using data from wearable sensors. We will use HAR dataset: http://groupware.les.inf.puc-rio.br/har. Downloadable data files: training and testing

Partitioning and exploration

all.data <- read.csv('pml-training.csv')
to.predict <- read.csv('pml-testing.csv')

library(caret)
set.seed(1234)
inTraining <- createDataPartition(all.data$X, p = 0.8, list = F)
training <- all.data[inTraining,]
testing <- all.data[-inTraining,]

summary(training)

Summary shows that for some columns almost all (more than 95%) values are the same (NA, empty, ‘no’). For example:

summary(training[,c('new_window', 'skewness_roll_belt', 'amplitude_pitch_belt')])

##  new_window  skewness_roll_belt amplitude_pitch_belt
##  no :15381           :15381     Min.   : 0.000      
##  yes:  317   #DIV/0! :    8     1st Qu.: 1.000      
##              0.000000:    4     Median : 1.000      
##              0.003969:    1     Mean   : 2.114      
##              0.005334:    1     3rd Qu.: 2.000      
##              0.005478:    1     Max.   :11.000      
##              (Other) :  302     NA's   :15381

Building a model

Let’s try the easy way first:

Columns with almost only one value are probably not very useful for predicting. Let’s remove them.
Also let’s assume all users are similar and let’s focus only on sensors, ignoring names and time series. This way we can do one single model for everyone instead of making individual model for each user.

usefulness.treshold <- nrow(training) * 0.95
useful.columns <- apply(training, 2, function(x){
      sort(table(x, useNA='always'), decreasing=T)[1] < usefulness.treshold})
useful.columns[1:7] <- F; # remove user names and time data

training <- training[,useful.columns]

We’ll start with random forest because it’s pretty powerful, easy to apply of-the-shelf method. To get some numbers quickly we’ll start with by-passing time consuming caret’s train and use directly randomForest (maybe we’ll get a model that is good enough). But this method doesn’t support NAs. Let’s see if we have to deal with any NAs in our data.

table(c( complete.cases(all.data[,useful.columns]), complete.cases(to.predict[,useful.columns]) ))

## 
##  TRUE 
## 19642

All rows are complete (training, testing and data for new predictions) so we don’t have to do any pre-processing like removing rows or imputing values. We can train the model:

library(randomForest)
system.time(fit <- randomForest(classe ~ ., data=training))
fit

##    user  system elapsed 
##  67.570   1.110  77.922 
## 
## Call:
##  randomForest(formula = classe ~ ., data = training) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.41%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4479    2    0    0    1 0.000669344
## B   12 3004    4    0    0 0.005298013
## C    0   10 2724    3    0 0.004749726
## D    0    0   24 2554    2 0.010077519
## E    0    0    1    6 2872 0.002431400

Random forest internally does cross-validation to estimate error rate. In our case it’s 0.41%. We consider this model to be good enough so we will not go back to use more time consuming train with bootstrapping.

Out of sample error

Let’s do a final check and see how good our model performs on completely new data:

predicted.classe <- predict(fit, testing)
confusionMatrix(predicted.classe, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1098    4    0    0    0
##          B    0  773    4    0    0
##          C    0    0  681    9    1
##          D    0    0    0  627    2
##          E    0    0    0    0  725
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9921, 0.9969)
##     No Information Rate : 0.2798          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9949   0.9942   0.9858   0.9959
## Specificity            0.9986   0.9987   0.9969   0.9994   1.0000
## Pos Pred Value         0.9964   0.9949   0.9855   0.9968   1.0000
## Neg Pred Value         1.0000   0.9987   0.9988   0.9973   0.9991
## Prevalence             0.2798   0.1980   0.1746   0.1621   0.1855
## Detection Rate         0.2798   0.1970   0.1735   0.1598   0.1848
## Detection Prevalence   0.2808   0.1980   0.1761   0.1603   0.1848
## Balanced Accuracy      0.9993   0.9968   0.9955   0.9926   0.9979

Out of sample error is: 0.51%

Predictions

predict(fit, to.predict)