3. Regression
Introduction
Linear Regression :
Simple approach for supervised learning .
Useful for predicting a quantitative response.
Many tools are generalization or extensions of linear regression.
Simple Linear Regression :
Linear relationship between the response Y and a single predictor X.
Denoted by: Y = β0 + β1 X
Y = response or dependent variable (the variable that we want to predict).
X = predictor or independent variable (we have data on this variable).
β0 and β1 = intercept and slope (regression model coefficients or parameters).
y ̂ = β0 ̂ + β1 ̂ x is the prediction for Y based on the value of X = x.
The \ ̂\
(hat symbol) denotes its prediction.
Residual : the difference between an observed value and its predicted value (e = y - y ̂).
Residual Sum of Squares (RSS) :
The sum of the squared residuals (RSS = e1^2 + e2^2 + … + en^2).
RSS = (y1 - β0 - β1 x1)^2 + (y2 - β0 - β1 x2)^2 + … + (yn - β0 - β1 xn)^2.
Least Squares :
It is a method for estimating the unknown parameters (β0, β1) in a linear regression model.
It chooses β0 and β1 to minimize the RSS.
β̂0 = y ̄ - β1 x ̄ (where x ̄ and y ̄ are the sample means).
β̂1 = Σ (xi - x ̄) (yi - y ̄) / Σ (xi - x ̄)^2.
Population regression line :
Represents true relationship between X and Y is linear as Y = β0 + β1 X + ε (where ε is a random error term).
β0 is the intercept: the expected value of Y when X = 0.
β1 is the slope: the average increase in Y associated with a one-unit increase in X.
ϵ is a catch-all for what we miss with this simple model, such as the effect of other variables on Y, but we assume this error is independent of X.
The least squares line does not include the error term ϵ. but the population regression line does.
In practice, we do not know the population regression line, so we use the least squares line as an estimate.
Bias : the difference between sample mean (for example, or any other statistic) and the population statistic.
Unbiased estimator : , an unbiased estimator does not systematically over- or under-estimate the true parameter.
Multiple Linear Regression :
It is a linear regression that have more than one predictor.
Each predictor is given its β0, β1 separately, and the sum of all of their simple lines is the multiple linear regression.
Each β1 represents the average effect on Y of a one unit increase in that predictor, holding all other predictors fixed.
Potential problems with regression:
Non-linearity of the response-predictor relationships : Residual plots can help identify non-linear relationships.
Correlation of error terms : if the error terms are correlated, we may have an unwarranted sense of confidence in our model.
Non-constant variance of error terms .
Outliers .
High-leverage points .
Collinearity .
KNN K-nearest neighbors regression :
It is an alternative, but not parametric, approach to linear regression.
References
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R. New York, NY: Springer. Chapter 3: Linear Regression
October 10, 2023
October 10, 2023
Back to top