Robust Regression for data with outliers
Linear Regression works awesome until there is the data that contains outliers.
What are outliers?
In the above graph, some points are lying far away from the regression line. These data or points are called outliers. For example, if a variable or a point in the graph follows the normal distribution, then an observation that is usually 3 or more standard deviations far from the mean is considered an outlier. So, a dataset having outliers can cause problems to a linear regression model as they bias the statistics for the calculated variable.
To overcome these problems with linear regression and outliers, Robust Regression is used.
RANSAC regressor is the most commonly used Robust regression algorithm. This algorithm first separates the data with outliers and inliers and deals with the regression process depending only on the inliers.
Scikit Learn has a model for the RANSAC regressor.
regressor = sklearn.linear_model.RANSACRegressor()
This model is fitted with the x and y values to train the regressor.
We can observe the difference between the regressor line for linear regression and the RANSAC regressor in the below graph for the same set of data.
Huber regression is another type of robust regression that is aware of the possibility of outliers in a dataset and assigns them less weight than other examples in the dataset.
We can use Huber regression via the HuberRegressor class in scikit-learn. The “epsilon” argument controls what is considered an outlier, where smaller values consider more of the data outliers, and in turn, make the model more robust to outliers. The default is 1.35.