Linear Regression

5 min readDec 8, 2022

Linear regression is a statistical method for finding the straight line or hyperplane that best fits a set of data points.

It is a widely used predictive modeling technique that assumes a linear relationship between the input variables (x) and the single output variable (y).

In linear regression, the model specification is that the dependent variable (y) is a linear combination of the independent variables (x). This is expressed mathematically as:

y = β0 + β1 * x1 + β2 * x2 + ... + βn * xn

where y is the dependent variable, β0 is the intercept term, β1, β2, ..., βn are the coefficients or weights associated with each independent variable, and x1, x2, ..., xn are the independent variables.

To find the best fit line or hyperplane, the linear regression model estimates the coefficients (β1, β2, ..., βn) that minimize the sum of the squared differences between the observed dependent variable (y) and the predicted dependent variable (ŷ). This process is known as "ordinary least squares" (OLS) estimation.

Once the coefficients have been estimated, the linear regression model can be used to make predictions on new data. This is done by plugging the estimated coefficients and the independent variables of the new data into the model equation:

ŷ = β0 + β1 * x1 + β2 * x2 + ... + βn * xn

The predicted value of the dependent variable (ŷ) represents the best estimate of the true value of the dependent variable (y) based on the given independent variables (x).

Linear regression is a powerful tool for predictive modeling, but it has some limitations. One of the main limitations is that it only works well when the relationship between the independent and dependent variables is linear. If the relationship is non-linear, a more sophisticated model may be needed.

Optimisation equations

The equation for linear regression is typically expressed as:

y = mx + b

where y is the dependent variable (the thing we are trying to predict), x is the independent variable (the thing we are using to make predictions), m is the slope of the line, and b is the y-intercept (the point where the line crosses the y-axis).

To find the optimal values for m and b, we can use a technique called gradient descent. This involves starting with initial guesses for m and b, and then iteratively updating these values using the following equations:

m = m - learning_rate * dm/dm
b = b - learning_rate * db/db

where learning_rate is a parameter that determines how big a step we take in the direction of the gradient, and dm/dm and db/db are the partial derivatives of the error function with respect to m and b, respectively.

The error function we use in linear regression is typically the mean squared error, which is the average squared difference between the predicted values and the true values. This is given by the following equation:

error = 1/n * sum((y_true - y_pred)^2)

where y_true is the true value of the dependent variable, y_pred is the predicted value, and n is the number of observations.

To find the partial derivatives of the error function with respect to m and b, we can use the following equations:

dm/dm = -2/n * sum(x * (y_true - y_pred))
db/db = -2/n * sum(y_true - y_pred)

We can then plug these equations into the equations for updating m and b, and iterate until the error function reaches a minimum. This will give us the optimal values for m and b, which will produce the best possible predictions using a linear model.

Best and Worst cases

Linear regression works best when the relationship between the dependent and independent variables is strong and linear. This means that as the independent variable increases or decreases, the dependent variable should also increase or decrease at a consistent rate.

Linear regression will perform poorly if the relationship between the variables is weak or non-linear. For example, if the dependent variable only increases when the independent variable is above a certain threshold, linear regression will not be able to capture this relationship.

Additionally, linear regression can be sensitive to outlier data points. If there are a few points in the dataset that are very different from the rest, these points can have a disproportionate impact on the regression line and lead to poor performance.

Overall, the best case for linear regression is a dataset with a strong, linear relationship and no outliers. The worst case is a dataset with a weak or non-linear relationship and significant outliers.

Bias vs variance trade off

In linear regression, bias and variance are two sources of error that can affect the performance of the model. Bias refers to the difference between the predicted values and the true values in the data. High bias can cause the model to consistently under or over-predict the true values, leading to poor performance.

Variance, on the other hand, refers to the variability of the predicted values for a given input. High variance can cause the model to make highly variable and unpredictable predictions, leading to poor performance.

There is a trade-off between bias and variance in linear regression. A model with low bias and high variance will over-fit the data, meaning it will match the training data very well but may not generalize well to new data. A model with high bias and low variance will under-fit the data, meaning it will not match the training data well and will also not generalize well to new data.

The goal in linear regression is to find a balance between bias and variance that produces a model that fits the data well and generalizes well to new data. This can be achieved by using regularization techniques, which penalize certain model parameters to prevent over-fitting.

Applications

Linear regression is a widely used statistical technique that can be used to model the relationship between two continuous variables. It is used in many real-world applications, including:

Economics: Linear regression can be used to model the relationship between economic indicators, such as gross domestic product (GDP) and unemployment rate.

Finance: Linear regression can be used to model the relationship between stock prices and various factors, such as earnings, dividends, and interest rates.

Medicine: Linear regression can be used to model the relationship between medical test results and various factors, such as age, gender, and lifestyle.

Sports: Linear regression can be used to model the relationship between athlete performance and various factors, such as training, nutrition, and sleep.

Overall, linear regression is a powerful tool for understanding the relationship between different variables and making predictions based on that relationship. It is widely used in many different fields and can provide valuable insights into complex real-world systems.

Linear Regression

Optimisation equations

Best and Worst cases

Bias vs variance trade off

Applications

Written by Charan H U

No responses yet