DA3. Regression Limits¶
Statement¶
Regression is statistical technique that uses past data to make a future prediction. Conduct research on the internet, the University of the People library or other resources that are academically reliable (The Directory of Open Access Journals and Google Scholar are good resources) develop a post that explains the key limitations of regression. A part of your response should address the limitation of the ‘range’ that regression analysis can predict.
Solution¶
Linear regression is a simple, powerful, and widely used tool to predict responses based on one or more predictor variables; and many tools are -in fact- a generalization or extension of linear regression. However, regression has some strong assumptions upfront, and according to (James et Al., 2013, p.92), there are several limitations of regression:
- Non-linearity of the response-predictor relationships.
- Correlation of error terms.
- Non-constant variance of error terms.
- Outliers.
- High-leverage points.
- Collinearity.
One of the assumptions of linear regression is that the relationship between the response and predictor variables is linear which may lead to misleading predictions if the relationship in fact is not linear. The residual plots can help identify non-linear relationships, and the appropriate transformation can be applied to the data to address this issue, but it is the responsibility of the data scientist to identify the non-linearity and apply the appropriate transformation.
Another assumption of linear regression is that the error terms are uncorrelated, that is, the error term for one observation is independent of the error term for any other observation. If the error terms are correlated, we must take this relationship between errors into account when estimating the model coefficients. If we do not, the standard error will be underestimated, leading to confidence and prediction intervals that are narrower than they should be (James et Al., 2013, p.94).
Another assumption of linear regression is that the error terms have a constant variance. If the error terms do not have a constant variance, then the standard errors, confidence intervals and hypothesis tests for the regression coefficients will be incorrect. The problem can be dealt with by transforming the error variance, or by using weighted least squares (James et Al., 2013, p.96).
Outliers refers to observations that are not typical of the rest of the data, while high-leverage points are predictions that are from other predictions generated by the model and both of these issues change dramatically the regression statistics; if there is a strong evidence that these values are due to errors, they can be removed, but with care.
The range problem is an interesting as all statistics based on observations within the input data range, however, the relationship between the response and predictor variables may change outside of the input data range (Frot J, n.d), and although we can estimate the response for values outside of the input data range, we cannot be 100% confident in the accuracy of the prediction, and thus the model may generate a a range of values instead of a single value where we expect that 95% of the time the actual value will fall within this range.
References¶
- James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York, NY: Springer. Chapter 3: Linear Regression
- Frost J. (n.d.). Making Predictions with Regression Analysis. Statistics By Jim. https://statisticsbyjim.com/regression/predictions-regression/