We will try to build machine learning algorithm to predicting human activity using data from wearable sensors. We will use HAR dataset: http://groupware.les.inf.puc-rio.br/har. Downloadable data files: training and testing
all.data <- read.csv('pml-training.csv')
to.predict <- read.csv('pml-testing.csv')
library(caret)
set.seed(1234)
inTraining <- createDataPartition(all.data$X, p = 0.8, list = F)
training <- all.data[inTraining,]
testing <- all.data[-inTraining,]
summary(training)
Summary shows that for some columns almost all (more than 95%) values are the same (NA, empty, ‘no’). For example:
summary(training[,c('new_window', 'skewness_roll_belt', 'amplitude_pitch_belt')])
## new_window skewness_roll_belt amplitude_pitch_belt
## no :15381 :15381 Min. : 0.000
## yes: 317 #DIV/0! : 8 1st Qu.: 1.000
## 0.000000: 4 Median : 1.000
## 0.003969: 1 Mean : 2.114
## 0.005334: 1 3rd Qu.: 2.000
## 0.005478: 1 Max. :11.000
## (Other) : 302 NA's :15381
Let’s try the easy way first:
Columns with almost only one value are probably not very useful for predicting. Let’s remove them.
Also let’s assume all users are similar and let’s focus only on sensors, ignoring names and time series. This way we can do one single model for everyone instead of making individual model for each user.
usefulness.treshold <- nrow(training) * 0.95
useful.columns <- apply(training, 2, function(x){
sort(table(x, useNA='always'), decreasing=T)[1] < usefulness.treshold})
useful.columns[1:7] <- F; # remove user names and time data
training <- training[,useful.columns]
We’ll start with random forest because it’s pretty powerful, easy to apply of-the-shelf method. To get some numbers quickly we’ll start with by-passing time consuming caret’s train and use directly randomForest (maybe we’ll get a model that is good enough). But this method doesn’t support NAs. Let’s see if we have to deal with any NAs in our data.
table(c( complete.cases(all.data[,useful.columns]), complete.cases(to.predict[,useful.columns]) ))
##
## TRUE
## 19642
All rows are complete (training, testing and data for new predictions) so we don’t have to do any pre-processing like removing rows or imputing values. We can train the model:
library(randomForest)
system.time(fit <- randomForest(classe ~ ., data=training))
fit
## user system elapsed
## 67.570 1.110 77.922
##
## Call:
## randomForest(formula = classe ~ ., data = training)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.41%
## Confusion matrix:
## A B C D E class.error
## A 4479 2 0 0 1 0.000669344
## B 12 3004 4 0 0 0.005298013
## C 0 10 2724 3 0 0.004749726
## D 0 0 24 2554 2 0.010077519
## E 0 0 1 6 2872 0.002431400
Random forest internally does cross-validation to estimate error rate. In our case it’s 0.41%. We consider this model to be good enough so we will not go back to use more time consuming train with bootstrapping.
Let’s do a final check and see how good our model performs on completely new data:
predicted.classe <- predict(fit, testing)
confusionMatrix(predicted.classe, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1098 4 0 0 0
## B 0 773 4 0 0
## C 0 0 681 9 1
## D 0 0 0 627 2
## E 0 0 0 0 725
##
## Overall Statistics
##
## Accuracy : 0.9949
## 95% CI : (0.9921, 0.9969)
## No Information Rate : 0.2798
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9936
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9949 0.9942 0.9858 0.9959
## Specificity 0.9986 0.9987 0.9969 0.9994 1.0000
## Pos Pred Value 0.9964 0.9949 0.9855 0.9968 1.0000
## Neg Pred Value 1.0000 0.9987 0.9988 0.9973 0.9991
## Prevalence 0.2798 0.1980 0.1746 0.1621 0.1855
## Detection Rate 0.2798 0.1970 0.1735 0.1598 0.1848
## Detection Prevalence 0.2808 0.1980 0.1761 0.1603 0.1848
## Balanced Accuracy 0.9993 0.9968 0.9955 0.9926 0.9979
Out of sample error is: 0.51%
predict(fit, to.predict)