One of the more difficult part of making predictive models is making sure that our model is robust to outliers. Fitting data strongly to outliers tends to give us bad predictive results. There are two solution that will work well in most cases and the are a few other methods worth considering when the first two either fail or are impracticle.
Mean squared error (MSE) and thus root mean squared error (RMSE) are very sensitive to outliers ^{1}. Sometimes, this is desirable where we want to punish an error of 20 much more than an error of 5. (400 vs 25). This is not desirable when we are trying to avoid outliers from having an exaggerated effect on our model.
Mean Absolute Error is somewhat less sensitive to outliers because it does not square the errors. It is not “robust” to outliers but it represents a middle ground between traditional loss functions and methods discussed below.
One of the better loss functions to choose is the Huber Loss function. It is defined as:
This has the effect of combining Squared Errors and Absolute Errors. For , the error function represents the RMSE. For , the error function represents Mean Absolute Errors.
Winsorizing the data provides the same results as the Huber Loss Function when in the huber loss function is equal to the point at which the value is winsorized. This has a large disadvantage compared to the Huber Loss Function in that we are making changes to the dataset. That leaves us with a choice between making two copies of the data (one winsorized and non-winsorized) or discarding some information that may come in useful in future analysis. The former may not be practical for very large datasets.
An advantage of winsorizing data is that it makes it easy to visualize that data in a meaningful way.
Pandas has pd.clip()
for this capability which makes it easy to use.
Source: Wikipedia
Most data munging has outliers that are almost certainly products of measurement errors or very rare anomalies.
Some examples of this:
Even in the event some of those observations were accurately measures (quite unlikely), they almost surely lack external validity so will only hurt the performance of our model unless we delete them.
Support Vector Machines in particular tend to not consider outliers during training. Gradient Boosting, and Random Forests also have nice properties that make it difficult (but not impossible) to overfit a training dataset.
Log transforms are often mooted as a good way to make outliers stand out less. When we use a log transform we are allowing the outliers to dictate how we describe all of our observations. This is just the opposite of using a robust measure.
This is not to say that log transforms should not be used. There are many situations when log transforms should be used due to the behavior of all the observations:
This is easy to see from the mathematical definition of MSE: . Since the distance between the sample mean and the datapoint is squared, it exaggerates the influence of outliers. ↩