MachineLearningAssignment/Prediction_Assignment.Rmd at master · hoogenm/MachineLearningAssignment · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Prediction Assignment Writeup"
author: "Marc van den Hoogen"
date: "June 15th - June 21th, 2015"
output:
  html_document:
    keep_md: yes
    number_sections: yes
    toc: yes
  word_document: default
pandoc_args:
- +RTS
- -K256m
- -RTS
---

# Introduction / summary

The goal of the 'prediction assignment' is to predict the manner in which (six) participants did a weight lifting
exercise.
For this, I downloaded a [training dataset](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv)
and a [test dataset](https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv).
I created a model, used cross validation, calculated an expected out of sample error. In this write-up, I will also describe _how_ I built the model, and _why_ I made the choices that I did. I also used the model to predict the (20)
test cases.

Note: the document has between _approximately_ 1,435 words (counted from Word-version using OpenOffice) and 1,539 words (counted from the HTML-version by [this online tool](http://felix-cat.com/tools/wordcount/)).

# Getting and cleaning the data

## Getting the data
* First, I download the test data and the training data. The R code is available in the repository (in the compiled version, some code will be suppressed by `echo=F`).
```{r echo=F}
if (!file.exists('pml-training.csv')) {
    download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',
                  destfile='pml-training.csv', method='curl')
    download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',
                  destfile='pml-testing.csv', method='curl')
}
```
* I then read the data with `read.csv`. The raw _training_ data will be in a data-frame variable called `training`
and raw testing data in `testing`. In numeric variables, the value `#DIV/0!` was present; this value was treated
as `NA` (otherwise the numeric variables cannot be treated as numeric).
```{r}
training <- read.csv('pml-training.csv', na.strings=c("NA", "#DIV/0!"))
testing <- read.csv('pml-testing.csv', na.strings=c("NA", "#DIV/0!"))
```
## Timestamp variables
I removed all _timestamp_ variables: the testing data consists of (20) single rows on which to base 20 predictions, so we need to make our predictions on a single measurement 'moment' and therefore I did not treat the data as time series data.
```{r message=F}
library(dplyr)
training2 <- select(training, -contains('timestamp')) # Remove timestamp variables
```

## Treatment of the `new_window` variable
The `new_window` variable seems to have a special meaning: in the testing dataset, all rows have `new_window` equal
to `0`. We therefore do not use rows with `new_window == 0` to make a prediction model for the new data (in the
testing set). And we can drop the variable if we subset to cases with `new_window == 0`.
```{r}
training2 <- training2[training2$new_window == 'no',]
training2 <- select(training2, -contains('new_window'))
```

## Variables that may identify specific cases
The variables `X` and `user_name` do not contain any actual measurement of **how** a participant performed an exercise,
but may (help to) identify measurements, which may lead to overfitting if included. So I decided to drop those
variables.
```{r}
training2$X <- NULL
training2$user_name <- NULL
```

## Remove variables that only have NA-values left
Variables where all rows only have `NA` left as a value (so the number of NA's in that column is equal to the
number of cases), have nothing left to contribute to the model, so we can remove those.

```{r}
columns_with_NA_only <- apply(training2, 2,
                              function (c) sum(is.na(c)) == nrow(training2))
training2 <- training2[, !columns_with_NA_only]
```

# Building and evaluating the model

## Initial choice of Machine Learning algorithm

Because of *noise* in the sensor data, the authors of the paper that was referred to in the assignment, chose for Random Forest. Our situation is not different. So I also choose for _Random Forest_, which is also known to have good prediction accuracy (compared to e.g. a single tree approach).
When I made the decision to start with random forest, I thought I could always try another approach if needed.
However, we will see that RF did work very well and a revised strategy was unnecessary.

## Splitting the 'training' data

I decided to split the data into 75% to train the model and 25% to predict the out of sample error. I decided to use
75% training data (instead of a lower value of 60% e.g.) because:

1. the training set is not too small
2. `caret` will already use resampling internally (therefore I think the risk of overfitting the model is already mitigated)
3. I want the model trained more thoroughly to use it on the official `testing` set (that I have set aside).

I will use `set.seed` to make my work reproducible.

```{r message=F}
library(caret)
set.seed(13234)
inTrainingPart <- createDataPartition(training2$classe, p=0.75, list=F)
trainingPart <- training2[inTrainingPart,]
testingPart <- training2[-inTrainingPart,]
```

## Training the model
Package `caret` provides a training function that can be used with Random Forest and also provides sensible defaults. The package also provides multicore/parallel processing, and because of the size of the
dataset and the algorithm and defaults used, I will turn it on.

```{r cache=T,message=F}
library(doParallel)
cluster <- makeCluster(3) # Adapt as needed, depending on hardware!
registerDoParallel(cluster)

model <- train(classe ~ ., data=trainingPart, method='rf', verbose=T)
```

`caret` will try several models and select a best model automatically. Now that we have such a (best) model (which `caret` selected for us), we will get some accuracy/error indications to show how well (or poor) the model did:

```{r}
# Get accuracy of 'bestTune' model:
accuracy.of.best.model <- model$results$Accuracy[as.integer(row.names(model$bestTune))]

# Get OOB error rate of the final model:
#   (note: by looking it up for the number of trees used, which is in 'ntree')
errorRate.OOB.Reported <- model$finalModel$err.rate[model$finalModel$ntree,1]
```

Results:

* The accuracy of the chosen model as reported by the algorithm is `r round(accuracy.of.best.model, 6)`
* The OOB (out of bag) error rate as reported is `r round(errorRate.OOB.Reported, 6)`.

Based on the 'high' indication for accuracy and the 'low' indication for OOB error rate, I decided to use this model to
go to the next step, and calculate an expected _out of sample error rate_ using the 25% data we set aside (see below).

## Cross-validation

Note:
According to the lectures, care should be taken to avoid overfitting with Random Forest, so cross-validation
is important.

At the end of the [video lecture on 'Random Forest'](https://class.coursera.org/predmachlearn-015/lecture/47) (week 3)
though, Jeff Leek specifically states that the `train` function in `caret` *will handle that* for you. Breiman, a developer of Random Forest, explains in the (online) section ["The out-of-bag (oob) error estimate"](http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#ooberr) how the algorithm leaves a subset of cases out of each sample (about one-third), and how the model thus results in a unbiased estimate (proven in practice).

We will still also _explicitly_ validate our model accuracy, with the 25% of 'training data' that we have set aside and left untouched.

```{r message=F}
library(randomForest)
classesReference <- testingPart$classe
testingPart$classe <- NULL

classesPredicted <- predict(model$finalModel, newdata=testingPart)
confMatrix <- confusionMatrix(data=classesPredicted, reference=classesReference)
OutOfSampleErrorRate  <- 1 - confMatrix$overall['Accuracy']; names(OutOfSampleErrorRate) = 'ErrorRate'
confMatrix
```

**The expected out-of-sample error is calculated as `r round(OutOfSampleErrorRate, 4)`**.

The Kappa statistic value (based on the 'hold out' dataset), is `r round(confMatrix$overall['Kappa'],3)`.

# Predictions for the official (downloaded) test set

First, we will apply the same preprocessing to the dataset, which in this case consists mainly of removing columns (given the dataset and the use of Random Forest, no scaling or normalizing was needed). Then we will apply `predict()` on the model and the _newdata_. The result will be used for the second part of the assignment: the submission
of the predictions on the 20 test cases.

```{r}
testing2 <- select(testing, -contains('timestamp'))
# We already checked that dataset 'testing' has only rows with new_window == no,
# we will remove the column however
testing2 <- select(testing2, -contains('new_window'))
testing2$X <- NULL
testing2$user_name <- NULL

testing2 <- testing2[,!columns_with_NA_only]

# make sure the prediction order matches the problem IDs
testing2 <- arrange(testing2, problem_id)

predictions_on_pml_testing <- predict(model$finalModel, newdata=testing2)
predictions_on_pml_testing
```

To write out the final results to (individual) answer-files, I used the script suggested by the course assignment:

```{r}
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}
pml_write_files(predictions_on_pml_testing)
```

The creation of the answer files finishes this writeup (and provides the files necessary for the 'submission' part of the assignment).