Approaching the two reasons why your model is not able to generalize well
If you’ve ever encountered the following problem:
I’m training a neural network and the training loss decreases, but the validation loss increases from the first epoch on. How can I fix this?
then you definitely should read this post to gain more understanding for the main two reasons why your model doesn’t generalize well.
Let us start with the overfitting most machine learning practitioners are familiar with: overfitting caused by aleatory uncertainty, simply said overfitting caused by noisy data.
Here we have to deal with the fact, that the process generating real data, oftentimes, exhibits intrinsic randomness. Let us illustrate this phenomenon using a small example of regression. Imagine the underlying relation between the input 𝑥 (independent variable) and the output 𝑦 (dependent variable) is defined by a third degree polynomial:
depicted in the figure below:
We can add some randomness to the above data generating process, by simply adding a small random number to the original value:
The noise value may come from an arbitrary probability distribution. Needless to say, that in real data the mechanism causing randomness is usually much more complex.
Let us next generate the training set, taking the just described aleatory uncertainty into account:
A possible validation set could be:
To demonstrate the effect of overfitting, we next fit a polynomial of a higher degree than that one used to generate the data. Say, we fit a polynomial of the 4th degree:
In our toy example we define the loss as the average distance between the predicted 𝑦 values and the noisy ground truth values. The goal of training/fitting the model consists of minimizing the loss and hence of minimizing the distances.
We observe, that from a certain epoch on the model starts to adapt to the noise. This shouldn’t surprise us, since it’s the only way to decrease the distance to the ground truth points even further. This however is causing the fitted model (red) to deviate from the actual model (blue) more and more.
At the end of the training, we obtain a training loss of zero, since the red line exactly crosses the four training points. However, don’t be fooled by a low training loss, since it says nothing about the generalization capability of your fitted model. How bad the situation really is, becomes obvious, when we take a look at the loss evaluated on the validation set.
We see, that at the end of the training, the validation loss is much larger than the training loss. This is usually a strong hint, that we’re overfitting.
Our first insight is: when you train a neural network on a small dataset, the network generally memorizes the training dataset instead of learning the general structure of the data. For this reason, the model will perform well on the training set and poor on new data (e.g. the validation dataset).
Once we’ve realized we’re facing overfitting, what can we do about it?
Solutions: More data
The simplest way to reduce aleatory overfitting is to increase the size of the training data:
Now, the capacity of the polynomial of 4th degree is not sufficient to exactly go through all of the eight training points. Hence, the model cannot adapt to the noise as much as before, resulting in a much better fit (compare the red and the blue curves). Please note, that at the end of the training the training loss is not exactly zero as before.
When we evaluate the model on the validation set, we see that it generalizes quite well.
When trained on a sufficient amount of data the final train loss and the validation loss should both be very low. Furthermore, both values should be very similar.
If you want your model to approximate the underlying relation between input 𝑥 and output 𝑦 as best as possible (exactly approximate the blue ground truth curve in our toy example), then collecting a sufficient amount of data is probably your only option. Only then your model will be able to learn the actual underlying structure of the data. This is by the way the secret of deep learning. Many real world applications usually have a very complex relation between input 𝑥 and output 𝑦. However, using a humongous amount of data and very complex (huge) neural network models, allow to learn the underlying structure quite well.
To acquire more data can however be very expensive and arduous task to do. It is not always a quick option to go. Do we have other options to reduce overfitting?
If achieving the best possible performance isn’t your primary goal and you’re willing to accept some compromises, you can try the following options.
Solutions: Reducing model’s capacity
The standard method of controlling overfitting is to apply regularization techniques. All regularization techniques have in common, that they impose some sort of smoothness constraint on the learned model. In our toy example, a more smooth polynomial means that it is less flexible, meaning it can bend less. As a consequence, a more smooth model is able to approximate only simpler functions. In literature the ability of a model to approximate complex functions is often called the capacity of a model.
In our toy example, we will illustrate the reduction of the model’s capacity by lowering the degree of the polynomial. Say, we fit a polynomial of the 1st degree:
which is much lower than the underlying ground truth model.
We see, that the reduced model doesn’t have enough capacity to adapt to the noise contained in the training data. Therefore, it is generalizing much better to unseen data, than the huge model we trained before on the small training set (in the example before we extended the training set).
However, as you can see, the fitted model (red) is not reflecting the ground truth (blue) very well. This is the compromise we were talking before. When having only a small training set, you can use regularization to improve your model to generalize much better to unseen data. However, you won’t be able to perfectly model the underlying relation between input 𝑥 and output 𝑦. Hence the performance of your model (w.r.t accuracy or other metrics) will be far from optimal.
Finally, let’s talk quickly about different regularization techniques. L1 and L2 are the most common types of regularization. Another interesting type of regularization is dropout, which also produces very good results and is consequently the most frequently used regularization technique in the field of deep learning.
If the mentioned regularization techniques fail to improve the generalization capability of your model, please try reducing your model’s complexity by simply using a smaller architecture.
Solutions: Early stopping
When training neural networks, we can also apply a regularization technique called early stopping. This technique exploits the fact, that neural networks are trained iteratively, using gradient descent.
In the beginning of the training, the weights of the neural network are initialized with very small numbers. Hence all intermediate outputs (in hidden layers) operate in the linear regions of the activations functions. The overall network behaves like a linear system and hence has limited capacity.
During later epochs the weights will usually grow larger in magnitude. Now, the intermediate outputs start to operate in the non-linear regions of the activation functions and the capacity of the network slightly increases.
When we see that the performance on the validation set is getting worse, we immediately stop the training on the model. In our initial example, in which we trained a 4th degree polynomial on the small training set, early stopping would yield:
Solutions: Data augmentation
Data augmentation are various techniques used to artificially increase the size of the training set by adding slightly modified copies of the available data or by adding newly generated synthetic instances.
In our toy example, we synthesize new training samples by adding noise to the input 𝑥 and reusing the corresponding ground truth value 𝑦.
Just like in the example demonstrating the effect of overfitting at the beginning of this post, we fitted a polynomial of the 4th degree to the augmented data:
A fourth degree polynomial doesn’t have enough capacity to exactly go through all of the eight training points. As a consequence, it cannot adapt to the noise, resulting in a much better fit.
Examples for data augmentations for image classification are: color modification, flipping, cropping, rotation, geometric transformations etc.
Solutions: Transfer Learning
Transfer learning is a machine learning method where a model trained on data from task A is partially reused on a different but related task B. Transfer learning only works when you have a lot more data for task A than for task B. We assume task B is the task you really want to do well on.
Transfer learning exploits the fact, that neural networks exhibit a hierarchical structure (are built up in a layer-wise fashion), and that features from early layers tend to be very similar for related tasks, and hence can be reused.
In our toy example, task B, the task we want to do well on and for which we don’t have enough data, is defined by our original third degree polynomial. Task A, the task we have enough data for and from which we want to transfer knowledge, is defined by a very similar third degree polynomial:
however with a much smaller offset 𝑑.
As in the initial example above, we fit a polynomial of the 4th degree to task A, which is of a higher degree than that one used to generate the data.
Now, we transfer the knowledge acquired on task A and reuse it on task B. To this end, we take the obtained values for coefficients 𝑎, 𝑏, 𝑐 and fix them. During training on task B we adapt only the offset 𝑑. This is what is meant by partially reusing a pretrained model.
The closer the related task A, the task we’re transferring knowledge from, is to task B, the better results we will obtain applying transfer learning. However, don’t be reluctant to try out weights pretrained on seemingly non-related tasks. Oftentimes the features in early layers are very general in nature, such us: low-pass filter, high-pass filter, band-pass filter, Gabor-filters etc. It however requires much data to obtain these early filters, data you don’t have.
Finally, at the end of the training phase, you can also unfix all coefficients and train the entire system using a much, much smaller learning rate. This process is called fine-tuning and oftentimes yields additional gain in performance.
Now we come to the second kind of overfitting, which is more subtle, causes however as much problems as the first one: overfitting caused by epistemic uncertainty, simply said overfitting caused by lack of training data.
Here we have to deal with the fact, that oftentimes the given task exhibits a very high variability in its data. Think for example of the image recognition task to distinguish between umbrellas and cigarette lighters. Both classes are very broad, since there are many different types of these objects, each with their own size, color, shape etc. This is also referred to as intra-class variation. High variability of data often leads to an underrepresented training set, not allowing to generalize to unseen data e.g. validation or test set.
Let us again illustrate this phenomenon using a small toy example. To represent data with high variability, we will model the underlying relation between input 𝑥 and output 𝑦 by some complex looking fourth degree polynomial:
Let us assume our training set has a moderate size, which makes us feel on the safe side. What we, however, don’t know, is that due to the high complexity of the data, our training set only covers a very small subset of the entire input space.
In real world data usually both epistemic and aleatory will be present. For illustration purposes, we however don’t consider aleatory uncertainty (noise) in this example.
A possible validation set could be:
Please note, how the samples in the validation set are coming from a different subset of the input domain than the training set.
To demonstrate the effect of epistemic overfitting, we fit a polynomial of the 4th degree to the training set:
During the training, the model has seen only the left part of the input domain and therefore has no information about the right part of the input domain. Consequently, it can’t extrapolate to unseen data. While neural networks perform very good at interpolation tasks, it is adequate to say that they cannot extrapolate. In other words, neural networks can only deal with things they have seen before.
Let us have a look on what happens on the validation set:
We see that validation loss increases very quickly and in the end is much larger than the training loss. A rapid rise of the validation loss (even from the first epoch on) usually is a strong hint, that we’re dealing with epistemic overfitting.
What instruments do we have to deal with epistemic overfitting?
Solutions: More data
Again, the simplest way (but also the most expensive one) to reduce epistemic overfitting is to increase the size of the training data. The training data must be extended in such a way, that it covers the whole domain of the input data.
As mentioned before, to acquire more data can be very expensive and arduous task to do. It is not always a quick option to go.
Not a solution: Reducing model’s capacity
The unique characteristic of epistemic overfitting is that regularization techniques usually will either have no effect at all or even worsen the generalization. The reason is simple: making the model more smooth, won’t help us to extrapolate to unseen data, as the following figure clearly shows:
Not a solution: Early stopping
Being a regularization technique also reducing the capacity of a model, early stopping will usually not have any positive effect on the generalization. Furthermore, when faced with epistemic overfitting the validation loss oftentimes will increase from the first epoch on. Hence, there won’t be a suitable point to early stop the training.
Solutions: Transfer Learning
In order to perform well on unseen data, the model would require the magical ability to extrapolate only from data seen during the training. However, this is not possible without any prior knowledge.
Transfer learning is one way to inject prior knowledge into our model. To this end, we require a related task A with a sufficient amount of data.
We fit a polynomial of the 4th degree to task A and partially reuse the pretrained model on task B. During training on task B, we fix the pretrained coefficients 𝑎, 𝑏, 𝑐 and fine-tune only the offset coefficient 𝑑. We obtain a much better fit to unseen data than before:
The closer the related task A, the task we’re transferring knowledge from, is to task B, the better results we will obtain from transfer learning.
Solutions: Feature engineering
If collecting more data is not an option and you don’t have a related task to pretrain your model on, then you may resort to hand engineer your features. Features are (usually nonlinear) transformations of your raw data into more complex representations of the data. They are carefully chosen according to the given task/problem and shall improve the performance of your model on unseen data.
As in the case of transfer learning, by engineering features you are injecting some prior knowledge into your model. This may allow the model to extrapolate to unseen data. This approach will however usually require a solid domain knowledge and a lot of trial and error.
Solutions: Change of Network Architecture
To illustrate the reasons why a change of network architecture might alleviate epistemic overfitting, we will use an example from image recognition. So far we assumed, data variability comes from intra-class variation, e.g. having chairs of multiple types in our dataset (office chair, wooden chair, armchair etc.). Oftentimes however, a high variability in the data originates from geometric transformations, e.g. office chair seen from different perspectives. In such cases, a change of the network architecture might help to reduce epistemic uncertainty.
Let us again use a toy example to illustrate the basic idea: imagine we’re building a system to classify different uppercase letters. The letters in our training set happen to be positioned in the upper left corner, while the letters in the validation set happen to be positioned in the lower right corner.
Now, assume we are newbies and naively choose a network made entirely of fully connected layers (dense layers). As we know, fully connected networks are not able to generalize knowledge learnt at one spatial position to other positions. As a consequence, our trained network will produce arbitrary output values for the vast majority of the validation set. If we had chosen a network architecture containing convolutional and pooling layers instead, we could have transferred the knowledge learned in the training set to the validation set.
The above example is contrived and the mechanism leading to epistemic uncertainty is obvious. In real world scenarios however, especially those from areas other than image processing, the transformations leading to epistemic uncertainty are usually deeply hidden and difficult to spot.
As a side note, traditional CNNs lack the property of scale and rotation invariance. They are not able to extrapolate knowledge obtained for one scale or rotation to different scales and rotations of the same object. A biologically inspired solution to achieve scale and rotation invariance is to apply the log-polar transform, as described in this paper: RetinotopicNet.
For the sake of completeness, we should cover one more concept, which is closely related to epistemic overfitting: data mismatch. Data mismatch simply means, that your training data and your validation or test data are coming from different ground truth distributions.
Nowadays, this phenomenon is often encountered, when models are trained mostly on data coming from the internet, and only evaluated on self-acquired data representing the actual production task. It’s tempting to get a large amount of training data from the internet, but usually this data won’t be perfectly representative.
The difference between data mismatch and epistemic uncertainty is, that in the latter, both training and test data are assumed to come from the same distribution. However, due to the high complexity of the underlying distribution combined with a lack of data, both training data and test data appear to come from different distributions. A guaranteed solution for epistemic uncertainty is however to collect more data.
When dealing with data mismatch, train set and test set are indeed coming from different distributions. Extending the training set usually won’t help to improve the generalization, unless we start to mix the data from both distributions.
As you can notice, there is a smooth transition between data mismatch and epistemic uncertainty.