Logistic Regression

Logistic Regression

Introduction

Logistic Regression is a supervised machine learning algorithm used to classify a target variable based on one or more continuous or discrete predictor features.

It is same as linear regression on the inside, only difference being, it uses a sigmoid function to predict the output rather than linear function used in linear regression. It can be used in binary classification problems. In binary classification, the value of output ranges between [0, 1].

As we know, the equation for linear regression is: \[ Y = a_0 + a_1X \]

This equation outputs a continuous numerical value. But, while using the logistic regression, we expect discrete outputs. This can be achieved by passing the value of Y through a logistic function. \[ f(Y) = \frac{1}{1 + e^{-Y}} \]

This can also be written as: \[ f(Y) = \frac{1}{1+ e^{-(a_0 + a_1X)}} \]

How Logistic Regression works

We saw above that logistic regression is used to predict the classes of target variable. For that, we have to pass the output of linear regression through a logistic/sigmoid function.

A sigmoid function is the one whose output ranges from 0 to 1. Below is the graph for sigmoid function.

[PICTURE]

We can see here that the data points are either at 0 or 1. In logistic regression, we specify a threshold value. Any value greater than than the threshold will be considered positive or True or 1 and any value less than the threshold value will be considered as negative or False or 0.

Cost Function

Cost function is the function that represents the average error between actual and predicted values. Minimizing the cost function results in building an accurate model.

You might remember the cost function we used in linear regression, let me tell you, it won't work here in case of logistic regression. In linear regression, Mean Squared Error(MSE) gave us a convex function, so that we could apply gradient descent on it to minimize the cost. In case of logistic regression, \(\hat{Y}\) is a non-linear function, if we apply MSE on it, it will give us a non-convex function.

  • There will be complications if we apply gradient descent technique to minimize a non-convex function.

  • Another reason is in classification problems, we have target values like 0/1. So, \((\hat{Y} - Y)^2\) will always be in between 0-1, which can make it difficult to keep track of the errors and it is difficult to store high precision floating numbers.

Thus we use another cost function in case of logistic regression. This cost function is called binary cross entropy aka log loss.

Log loss is the most important classification metric based on probabilities. It's hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log-loss value means better predictions.

Mathematical Interpretation,

Log loss is the negative average of the log of corrected predicted probabilites for each instance.

There are three steps to find Log Loss:

  1. To find corrected probabilities.
  2. Take a log of the corrected probabilites.
  3. Take the negative average of the values we get in 2nd step.

For example,

[PICTURE]

The model is giving predicted probabilites as shown above:

By default, the output of the logistic regression is the probability of the sample being positive (indicated by 1) i.e. if a logistic regression model is trained to classify on a 'company dataset' then the predicted probability column says what is the probability that the person has bought jacket. Here in the above dataset the probability that a person with ID6 will by a jacket is 0.94.

In the same way, the probability that a person with ID5 will buy a jacket (i.e. belong to class 1) is 0.1 but the actual class for ID5 is 0, so the probability for the class is (1 - 0.1) = 0.9. So, the corrected probability for ID5 is 0.9.

[PICTURE]

We will find a log of corrected probabilities for each instance.

[PICTURE]

As you can see these log values are negative, we take negative average of these values, to maintain a common convention that lower loss scores are better. \[ \textrm{log loss} = - \frac{1}{N}\sum_{i=1} ^{N} log(P_i) \]

Cost Function Minimization

The benefits of taking logarithm in the cost function reveal themselves when you look at the cost function graphs for actual class 0 and 1.

[Graph here]

  • The Red line represents 1 class. As we can see, when the predicted probability (x-axis) is close to 1, the loss is less and when the predicted probability is close to 0, loss approaches infinity.
  • The Black line represents 0 class. As we can see, when the predicted probability (x-axis) is close to 0, the loss is less and when the predicted probability is close to 1, loss approaches infinity.

Performance Metrics

  • Confusion Matrix: A confusion matrix is a table that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.
  • Classification Report: There are 3 parameters in this:
    1. Precision: It is defined as the total number of positive patterns predicted correctly, by total number of patterns in positive class.
    2. Recall: It is the fraction of patterns correctly classified.
    3. F1 Score: This is defined as the harmonic mean between Precision and Recall values.
  • ROC Curve: "Receiver Operating Characteristic Curve" aka ROC Curve is the score which lies between 0 to 1. Here, 0 stands for bad and 1 stands for good. 0.5 is better. If ROC score is 0.78 then it means 78% of predicted values are correct and rest 22% are predicted wrongly.
  • Accuracy Score: Usual metrics to predict the overall accuracy of the model.