Regression is an entry-level topic that you absolutely MUST know.
And you’ve come to the right place.
In this article, we look at the use of regression techniques in machine learning and in particular, predictive modelling.
We will now take a brief look at seven regression techniques, some of which were developed over two centuries ago, and discuss their utility.
What is Regression Analysis?
Regression analysis can be described as a type of predictive modelling as it is used to determine the relationship between two or more variables.
One can use it to understand how changes in the value of one or more input (independent or predictor) variables, which can be controlled, affect the value of the output (dependent or target) variable.
Such algorithms find utility in time-series modelling, forecasting and for making causal inferences.
For example:
Regression analysis can be used to study the relationship between drink driving and the number of fatal road accidents.
It can also be used to study the relationship between variables measured on different scales, such as, the effect of share price changes on the buying or selling of stocks.
In short, regression analysis aids market researchers, data analysts and computer scientists in identifying the best set of variables for their predictive models.
Why Regression Analysis is Important in Machine Learning
Machine learning, in particular the aspect associated with predictive modelling, is arguably nothing but applied statistics.
Both focus on decreasing the error of a model in order for it to make the most accurate predictions.
It is no surprise then that computer scientists borrowed from the well-understood field of statistics in order to develop predictive models that can be applied to machine learning.
Regression analysis is among the most used statistical modelling technique in machine learning.
Advantages & Disadvantages of Regression Analysis
The major advantages of regression analysis are its abilities to indicate the significant relationships between dependant variable and independent variable(s), and even to indicate the extent to which each independent variable affects the dependent variable.
Older regression models do have a few problems.
For example:
They do run on a few assumptions that can lead to inaccuracy.
They, for example, assume that the data is normal (the sample represents the population of data), the independent variables (predictors) are not correlated and measured without error, and homoscedasticity (that the variance of the error is constant across all observations).
Over-fitting is another disadvantage.
This occurs when we have a large number of parameters relative to the observations, resulting in a model that is a good fit with the training data but fails to make accurate predictions when it comes to variables outside of the training dataset.
Underfitting, on the other hand, occurs when the model makes too many assumptions and, therefore, isn’t properly fit to the training data nor able to make suitable predictions on the test data.
The problems of over- and under-fitting can, however, be overcome.
A remedy for the former is to reduce the complexity of the model by removing some of the parameters or using regularised parameters, while a remedy for the latter is to add parameters to increase complexity and reduce bias.
What Each Regression Technique is Useful for
In case you’re in a hurry and just need a solution, here’s a quick look at the strengths of each of the regression techniques discussed here:
1. Linear regression: To be applied when there exists a linear relationship between independent and dependent variables.
2. Logistic regression: Unlike linear regression, it does not assume a linear relationship.
3. Polynomial regression: When it is necessary to fit curves to the data.
4. Stepwise regression: A good method for automatic variable selection.
5. Ridge regression: Excellent for overcoming the problem of overfitting.
6. Lasso regression: An excellent method for feature selection, and also a good shrinkage method.
7. Jackknife regression: To estimate the predictive ability of a model.
8. ElasticNet regression: Overcomes any limitations of ridge and lasso on large datasets.
1. Linear Regression
Linear Regression is the oldest and most practiced form of regression analysis.
In linear regression, the independent variable(s) (X) are either continuous or discrete, the dependant variable (Y) is discrete and the regression line is linear.
Linear regression can be simple, with just one independent variable, or have more than one independent variable.
The model provides the relationship between the variables using the best fit straight line and it is represented by the equation Y=a+b*X+e, where ‘a’ is the Y intercept, ‘b’ is the slope of the regression line and ‘e’ is the error.
In the figure above, a regression line shows the relationship between the independent variable ‘height’ and the dependent variable ‘weight’.
Simple linear regression can be measured by calculating the mean, standard deviations, correlations and covariance of each observation from the data.
When there is more than one independent variable, however, we generally use the Least Squares method.
This method calculates the regression line by minimising the sum of squares of the variance of each data point from the line.
When using linear regression it is worth noting its assumptions and limitations, including sensitivity to both outliers and cross-correlations (both in the variable and observation domains), and potential for over-fitting.
Its sensitivity to outliers can drastically affect the regression line and thus the predicted values.
Multiple linear regression is limited by multicollinearity, which increases the variance of the coefficient estimates eventually resulting in unstable coefficient estimates that are sensitive to minor changes in the model.
2. Logistic Regression
The Logistic Regression method is used to find the probability of success or failure of an event when the dependent variable (Y) is binary (for example, 0 or 1, Yes or No).
It is used extensively in fraud detection, clinical trials and scoring. Unlike linear regression, this can handle non-linear relationships between variables, too.
It has the freedom to handle various types of variable relationships as it applies non-linear long transformations to the predicted odds ratio.
In the figure above, ‘p’ is the probability that event Y occurs, ‘p/(1-p)’ is the odds ratio and ‘ln[p/(1-p)]’ is the log odds ratio, or logit.
Since the above method uses binomial distribution (as the dependant variable is binary) where the function’s variable represents a probability, the logit function is best suited for the task.
The logit function maximises the odds of observing the sample values rather than minimising the sum of squared errors such as in the case of linear regression.
The logistic distribution is S-shaped and the logit function constrains the estimates to be either 0 or 1.
This method can avoid over-fitting and under-fitting by simply including all significant variables.
This can be done by using the stepwise method (described later) to estimate logistic regression.
The model does have a few drawbacks, though.
For example,
It requires a large sample size and it only works when independent variables are not correlated.
3. Polynomial Regression
A polynomial regression equation occurs when the power of the independent variable is more than 1.
In this method, the relationship between the variables is modelled as an nth degree polynomial of X.
The equation below represents a polynomial equation Y=a+b*x^2.
In polynomial regression, the best fit line (in the figure below) is a curve and not a straight line.
A polynomial line
Polynomial regression fits a nonlinear model to the data due the nonlinear relationship of X and Y.
However, since it is linear from the perspective of its coefficients, polynomial regression is considered a special case of multiple linear regression.
It is worth noting that it can be difficult to interpret the individual coefficients in a polynomial regression fit as the underlying monomials are sometimes highly correlated.
Although fitting a higher degree polynomial provides lower error, it can also result in overfitting.
Instead, greater focus should be put on ensuring that the curve fits the nature of the problem.
The fitted model will be more accurate with a larger sample size.
4. Stepwise Regression
This form of regression is rather interesting as the selection of independent variables is automated.
How so?
Well, what we mean is that the selection of ‘X’ or predictor variables is done without the user’s intervention.
The regression model is built from a set of candidate predictor variables such that in each step a variable is either added or removed from the dataset.
This stepwise process continues until there is no longer a justifiable reason to add or remove any more variables based on the predefined criterion.
More often than not, this is achieved by a sequence of observations of statistical values such as F-tests, R-square, t-stats or a number other methods, in order to discern significant variables.
This model aims to maximise predictive capability while using the least amount of predictor variables.
Listed below are three of the preferred stepwise regression methods:
- Standard stepwise regression: This method simply adds and removes X variables as needed for each step.
- Forward selection: This method starts with most significant predictor variable in the model and then continues to add variables at each step.
- Backward elimination: This method starts with all predictors in the model and then continues to remove the least significant variable found in each step.
Stepwise recognition is used in data mining but this is not without hindrances.
For instance, the models are generally oversimplified and the tests are biased as they are based on seen data.
Also it is important that all the variables that actually predict outcomes are included in the list of candidate predictor variables.
If however, any of these predictors are dropped from the list, we will undoubtedly end up with a regression model that is underspecified and very likely misleading.
5. Ridge Regression
When the number of independent variables greatly exceeds the number of observations (or there is a high correlation), that is when the data exhibits multicollinearity (more than one X satisfies the equation).
In such a case, ridge regression (also referred to as weight decay in machine learning) is the ideal technique for predictive modelling.
Multicollinearity results in large variances where the observed values are not a good representation of the true value despite the ordinary least squares estimates being unbiased.
This method adds a degree of bias to the regression estimates in order to reduce standard errors.
Ridge regression is a robust version of linear regressions that puts constraints (by use of penalty terms) on regression coefficients.
This makes the coefficients more natural, less likely to be subjected to overfitting and thus, easier to understand.
Now in linear regression, prediction errors are either due to bias, variance or both.
In terms of ridge regression, the focus is on remedying the errors caused by variance.
Here, multicollinearity is remedied through a shrinkage parameter or penalty term λ (lambda).
It is worth noting that ridge regression is a regularization method that shrinks the parameters, thus it primarily prevents multicollinearity.
That said, coefficient shrinkage also reduces the overall complexity of the model.
Also, its assumptions are the same as ordinary least squared (linear) regression with the exception of normality being assumed.
6. LASSO Regression
Least Absolute Shrinkage and Selection Operator regression or LASSO for short is quite similar to ridge regression, such that it too constrains the regression coefficients through penalties.
However, unlike ridge regression, which penalises the coefficient squares, this method penalises the absolute value of the regression coefficients.
Penalties of absolute values results in some of the parameter estimates being zero.
Thus, variable selection occurs out of given n variables.
Due to LASSO regression’s automated feature selection, it is generally the model of choice when there are a large number of variables.
Correlated variables can pose a problem for LASSO regression as only one variable is retained while the correlated variables are set at zero.
This unfortunately can lead to loss of information, thereby, reducing the accuracy of the model.
7. Jackknife Regression
Jackknife regression can be used to evaluate the quality of the predictions of regression models.
When algorithms (say a machine learning model) use a large number of parameters relative to observations they become prone to overfitting (that is, they are good at predicting data that is within the training dataset but poor at predicting test data).
Such models tend to be complex and difficult to evaluate.
The jackknife method is particularly useful in such cases as it can be used to estimate the actual predictive power of such models.
It does so by predicting the dependent variable values of each observations as if the observation was new and not in the dataset.
Put simply, it trains the model by excluding one observation each time and then testing the model on that observation.
Thus by systematically excluding each observation from the dataset one at a time, then calculating the estimate, and finally finding the average of these calculations, the procedure is capable of obtaining an unbiased prediction with relatively low overfitting.
This method of regression is effective at clustering and data reduction.
It is well suited to black-box predictive algorithms – these are usually proprietary algorithms where the end users are only aware of the inputs and outputs, not the internal working of the algorithms.
Moreover, jackknife regression is robust when the assumptions of traditional regression including non-correlated variables, normal data and homoscedasticity (the assumption when all values of the independent variables show the same variance) are violated.
8. ElasticNet Regression
ElasticNet Regression is something of an update on ridge and lasso regression that overcomes the limitations of its predecessors.
Some machine learning experts would recommend picking it over ridge and lasso in any situation, but it is better known for its results when working with large datasets.
Let’s go over it very briefly.
Let’s assume you have a few independent variables that are correlated. Instead of keeping them separate, ElasticNet seeks to combine them by forming a group.
This is useful because, even if one variable in the group has a significant relationship with the dependent variable, the entire group will find space in the model.
Conclusion
While you are now aware of nine regression techniques, you’re probably wondering if there is a way know when you should pick one over the other.
Unfortunately, there isn’t.
And there are several other techniques (from Bayesian, ecologic, logic, etc). For now, though, as a beginner, try your hand at the ones we’ve discussed, and you’ll end up teaching yourself to decide which one is useful with a given dataset.