Linear Regression

Linear Regression

Introduction

Linear Regression is a supervised machine learning algorithm which finds a best fit line between a dependent variable and one or more independent variables. This means that linear regression finds a linear relationship between dependent and independent variables.

It is generally used to predict the trend hidden in the features. It's output is continuous.

The best fit line is the one for which total prediction error (all data points) is as small as possible. Error is the distance between the data point to the regression line.

Linear Regression is of two types:

  1. Simple Linear Regression: In simple linear regression, the model has to find best fit line between one target and one predictor feature.

  2. Multiple Linear Regression: In multiple linear regression, there is one target feature and more than one predictor features.

How Linear Regression Works

Mathematically, a simple linear regression model can be represented as: $$Y = a_0 + a_1X$$ Here,

  • \(Y\) = Target/Dependent Variable
  • \(X\) = Predictor/Independent Variable
  • \(a_1\) = Coefficient or Slope
  • \(a_0\) = Y - Intercept

This equation might sound similar to you. Yes, it is the equation of a straight line in slope-intercept form.
Linear Regression model trains on \(X\) and \(Y\) features to estimate the values of \(a_0\) and \(a_1\), which then are stored for future predictions.

Cost Function

Errors are calculated between the actual values of target variable(\(Y\)) and predicted values of target variable(\(\hat{Y}\)).

Error = Actual values (\(Y\)) - Predicted values (\(\hat{Y}\))
Sum of Errors = \(\sum(Y - \hat{Y})\)
Square of Sum of Errors = \(\sum(Y - \hat{Y})^2\)

Mathematically, \[ Cost(J) = \sum (Y_i - \hat{Y_i} )^2 \] If the error between actual and predicted values is high, we have to minimize the error to get the best model. This can be done using Gradient Descent technique.

Cost Minimization/Gradient Descent

Gradient Descent is used to minimize the error or cost of our model. To understand gradient descent in simple terms, let's think of a goat at the top of a mountain wanting to climb down. It can't just jump off the mountain but will descent in the direction of slope one step at a time. This analogy is used in gradient descent. Each step of the goat in our case is one iteration where we update the parameters.

[PICTURE]

Learning Rate is very important in gradient descent. It determines the size of step the goat takes at a time. If the step is too large, we may miss the local minimum. And if the step is too small, it may take many iterations to arrive at the minimum. It is denoted by alpha(α).

[PICTURE]

Now, you may be wondering, how do we actually minimize the error using gradient descent. Let's see that:

Here, we'll need some knowledge of calculus. \[ J = \frac{1}{n} \sum_{i=1} ^{n} (pred_i - y_i)^2 \]

\[ J = \frac{1}{n} \sum_{i=1} ^{n} ( a_0 + a_1.x_i - y_i )^2 \]

\[ \frac{\partial J}{\partial a_0} = \frac{2}{n} \sum_{i=1} ^{n} (a_0 + a_1 . x_i - y_i) \Rightarrow \frac{\partial J}{\partial a_0} = \frac{2}{n} \sum_{i=1} ^{n} ( pred_i - y_i) \]

\[ \frac{\partial J}{\partial a_1} = \frac{2}{n} \sum_{i=1} ^{n} (a_0 + a_1 . x_i - y_i) . x_i \Rightarrow \frac{\partial J}{\partial a_1} = \frac{2}{n} \sum_{i=1} ^{n} ( pred_i - y_i) . x_i \]

To minimize the error, we have to iteratively update the values of \(a_0\) and \(a_1\). To update \(a_0\) and \(a_1\), we take gradients from our cost function. This can be done by taking partial derivatives of cost function with respect to \(a_0\) and \(a_1\). After getting the gradients of \(a_0\) and \(a_1\), we update the previous values of \(a_0\) and \(a_1\) iteratively.

We also have to specify a learning rate(α) while updating the parameters. As we discussed above, learning rate is the size of step that gradient descent takes to converge to the minimum.

\[ a_0 = a_0 - \alpha . \frac{2}{n} \sum_{i=1} ^{n} (pred_i - y_i) \]

\[ a_1 = a_1 - \alpha . \frac{2}{n} \sum_{i=1} ^{n} (pred_i - y_i) . x_i \]

Performance Metrics

To understand the performance of the Regression model performing model evaluation is necessary. Some of the Evaluation metrics used for Regression analysis are:

  1. R squared or Coefficient of Determination: The most commonly used metric for model evaluation in regression analysis is R squared. It can be defined as a Ratio of variation to the Total Variation. The value of R squared lies between 0 to 1, the value closer to 1 the better the model. \[ R^2 = 1 - \frac{SS_{RES}}{SS_{TOT}} = \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} \] where, \(SS_{RES}\) is the Residual Sum of squares and \(SS_{TOT}\) is the Total Sum of squares.

  2. Adjusted R squared: It is the improvement to R squared. The problem/drawback with \(R^2\) is that as the features increase, the value of \(R^2\) also increases which gives the illusion of a good model. So the Adjusted \(R^2\) solves the drawback of \(R^2\). It only considers the features which are important for the model and shows the real improvement of the model. Adjusted \(R^2\) is always lower than \(R^2\). $$ R^2_{adjusted} = 1 - \frac{(1 - R^2)(N - 1)}{N - p - 1} $$ where,

    • \(R^2\) = sample R - square
    • \(p\) = Number of predictors
    • \(N\) = Total sample size.
  3. Mean Squared Error (MSE): Another Common metric for evaluation is Mean squared error which is the mean of the squared difference of actual vs predicted values. \[ MSE = \frac{1}{n} \sum_{i=1} ^{n} ( y_i - \hat{y_i})^2 \]

  4. Root Mean Squared Error (RMSE): It is the root of MSE i.e Root of the mean difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE doesn’t. \[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1} ^{n} (y_i - \hat{y_i})^2} \]