We have been using:
- Training data
Not representative of production.
We want to pretend like we are seeing new data:
- Hold back some data
When we train the model, we do so on some data. This is called training data.
Up to now, we have been using the same training data to measure our accuracy.
If we create a lookup table, our accuracy will be 100%. But this doesn’t generalise to new examples.
So instead we want to pretend like we have new examples and use that to test our model.
In other words, we hold back some data.
Separate the data into a training set and a test set.
Test set size approx. 10-40%.
Minimum size depends on number of features and complexity of model.
When we obtain a dataset, we would separate the data into a training set and a test set.
The size of the test set is usually somewhere between 10-40% of the size of the whole dataset.
Generally, the more data you have, the smaller the test set can become.
This way, we can get an accurate estimate of performance if the algorithm was to see new data (assuming that the random elements of the data are stationary!)
However, there are issues with a simple holdout technique like this.
Put simply, think really hard about the test data
- Is it independent from the training?
- Does it represent realistic data?
Common structures found within data:
- Key (e.g. when obtaining data from a database)
- Value or label (e.g. all elements of class 0 first, then class 1, …)
- Only sampled from certain geographies. Doesn’t scale to other geographies
Thankfully the fix is simple. Always randomise data before training.
One issue is that the data in a set often has structure. E.g. it could be ordered or collected in such a way that when we pick a observation to train or test against, it doesn’t represent the population.
Imagine trying to tune a hyperparameter…
Can you see the issue?
We’re using the test set to train hyperparameters!
We saw above that a common task is to alter some parameter of the model to improve performance.
If we repeatedly alter the hyperparameter to maximise the test set score, we’re not really finding the best model. We are tuning the hyperparameters to best represent the test set.
Can you see the issue here?
We are using the test set to train out hyperparameters! Over time we would overfit our model to fit the test set!
The simplest fix for this problem is to introduce another holdout set called the validation set.
The validation dataset is a second holdout set that is to be used when computing final accuracies.
- Significantly reduces the amount of data available for training
The main issue with using a validation set is obvious from the image. It significantly reduces the amount of data that can be used to train the model.
This will ultimately affect model performance. Since more data usually means better performance.
The simplest fix is to retrain the best model using all of the training and test data put together…
But we’re still not using the data in the test set to train the model. The data in the test dataset might be important to train the model.
- For each new training run, pick a new subset of the data to train/test against.
Cross-validation is a process where we repeatedly perform a fitting procedure but each time pick a new test set to train against.
This way we use all of the test set to train the model with, but we are still able to pick the best model before final validation on an independent dataset.
The choice of the number of iterations is called the number of folds. In the example above there
- Run through all the folds per training run
Then we have statistics about how consistent our model is over the various folds.
I.e. we can calculate the mean and standard deviation of our score.
The main issue is the additional time required to repeat the training process for each fold.
This is increasingly problematic for complex models like deep learning.
Let’s talk about visualising overfitting