*The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N the number of features) that distinctly classifies the data points.*

Hyperplanes are decision boundaries that help classify the data points. The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.

In SVM, we take the output of the linear function and if that output is greater than 1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin.

To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.

**Identify the right hyper-plane (Scenario-1):**Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-plane to classify stars and circles.**Identify the right hyper-plane (Scenario-2):**Here, we have three hyper-planes (A, B, and C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?**Identify the right hyper-plane (Scenario-3):**Hint: Use the rules as discussed in previous section to identify the right hyper-plane**Can we classify two classes (Scenario-4)?:**Below, I am unable to segregate the two classes using a straight line, as one of the stars lies in the territory of other(circle) class as an outlier.**Find the hyper-plane to segregate to classes (Scenario-5):**In the scenario below, we cant have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have only looked at the linear hyper-plane.

In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is **hinge loss**.
\[
c(x, y, f(x)) = \left \{
\begin{array}{ c l }
0, & \quad \textrm{if } y * f(x) \geq 1 \\
1 - y * f(x), & \quad \textrm{otherwise}
\end{array}
\right.
\]

\[ c(x, y, f(x)) = (1 - y * f(x))_+ \]

The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below. \[ min_w\lambda || w ||^2 + \sum_{i=1}^{n}(1-y_i\langle x_i, w \rangle )_+ \]

Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using the gradients, we can update our weights. \[ \frac{\delta}{\delta w_k} \lambda ||w||^2 = 2 \lambda w_k \]

\[ \frac{\delta}{\delta w_k} (1 - y_i \langle x_i, w \rangle )_+ = \left \{ \begin{array}{ c l } 0, & \quad \textrm{if } y_i\langle x_i, w \rangle \geq 1 \\ -y_ix_{ik}, & \quad \textrm{otherwise} \end{array} \right. \]

When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter. \[ w = w - \alpha \cdot (2\lambda w) \]

When there is a misclassification, i.e our model make a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update. \[ w = w + \alpha \cdot (y_i \cdot x_i - 2\lambda w) \]

These are commonly recommended for text classification because most of these types of classification problems are linearly separable.

The linear kernel works really well when there are a lot of features, and text classification problems have a lot of features. Linear kernel functions are faster than most of the others and you have fewer parameters to optimize.

Here's the function that defines the linear kernel: \[ f(X) = w^T*X + b \]

The polynomial kernel isn't used in practice very often because it isn't as computationally efficient as other kernels and its predictions aren't as accurate.

Here's the function for a polynomial kernel: \[ f(X_1, X_2) = (a + X_1^T * X_2)^b \]

This is one of the more simple polynomial kernel equations you can use. \(f(X_1, X_2)\) represents the polynomial decision boundary that will separate your data. \(X_1\) and \(X_2\) represent your data.

One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear data.

Here's the equation for an RBF kernel: \[ f(X_1, X_2) = exp(-\gamma * ||X_1 - X_2||^2) \]

In this equation, \(\gamma\) specifies how much a single training point has on the other data points around it. \(||X_1 - X_2||\) is the dot product between your features.

More useful in neural networks than in support vector machines, but there are occasional specific use cases.

Here's the function for a sigmoid kernel: \[ f(X, y) = \tanh(\alpha * X^T * y + C) \]

In this function, \(\alpha\) is a weight vector and \(C\) is an offset value to account for some mis-classification of data that can happen.

There are plenty of other kernels you can use for your project. This might be a decision to make when you need to meet certain error constraints, you want to try and speed up the training time, or you want to super tune parameters.

Some other kernels include: ANOVA radial basis, hyperbolic tangent, and Laplace RBF.

]]>Classification is a two-step process, learning step and prediction step, in machine learning. In the learning step, the model is developed based on given training data. In the prediction step, the model is used to predict the response for given data. Decision Tree is one of the easiest and popular classification algorithms to understand and interpret.

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, the decision tree algorithm can be used for solving **regression and classification problems** too.

In a Decision tree, there are two nodes, which are the **Decision Node** and **Leaf Node**. Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. A decision tree simply asks a question, and based on the answer (Yes/No), it further split the tree into subtrees.

* It is a graphical representation for getting all the possible solutions to a problem/decision based on given conditions*.

In Decision Trees, for predicting a class label for a record we start from the **root** of the tree. We compare the values of the root attribute with the records attribute. On the basis of comparison, we follow the branch corresponding to that value and jump to the next node.

In order to build a tree, we use the **CART algorithm**, which stands for **Classification and Regression Tree algorithm**.

**Root Node:**It represents the entire population or sample and this further gets divided into two or more homogeneous sets.**Splitting:**It is a process of dividing a node into two or more sub-nodes.**Decision Node:**When a sub-node splits into further sub-nodes, then it is called the decision node.**Leaf / Terminal Node:**Nodes do not split is called Leaf or Terminal node.**Pruning:**When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.**Branch / Sub-Tree:**A subsection of the entire tree is called branch or sub-tree.**Parent and Child Node:**A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

- In the beginning, the whole training set is considered as the
**root**. - Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model.
- Records are
**distributed recursively**on the basis of attribute values. - Order to placing attributes as root or internal node of the tree is done by using some statistical approach.

The decision of making strategic splits heavily affects a trees accuracy. The decision criteria are different for classification and regression trees.

Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. In other words, we can say that the purity of the node increases with respect to the target variable. The decision tree splits the nodes on all available variables and then selects the split which results in most homogeneous sub-nodes.

The algorithm selection is also based on the type of target variables.

Below are some algorithms used in Decision Trees:

**ID3**(extension of D3)**C4.5**(successor of ID3)**CART**(Classification And Regression Tree)**CHAID**(Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)**MARS**(multivariate adaptive regression splines)

The most widely used among them are ID3 and CART. Thus, we will discuss these two algorithms here.

Information gain is the measurement of changes in entropy after the segmentation of a dataset based on an attribute. It calculates how much information a feature provides us about a class. According to the value of information gain, we split the node and build the decision tree. A decision tree algorithm always tries to maximize the value of information gain, and a node/attribute having the highest information gain is split first. It can be calculated using the below formula: \[ Information \hspace{1.25mm} Gain= Entropy(S) - [(Weighted \hspace{1.25mm} Avg) * Entropy(each \hspace{1.25mm} feature) \]

**Entropy:** Entropy is a metric to measure the impurity in a given attribute. It specifies randomness in data. Entropy can be calculated as:
\[
Entropy(s) = -P(yes)\log P(yes) - P(no) \log P(no)
\]
Where,

**S= Total number of samples****P(yes)= probability of yes****P(no)= probability of no**

Gini index is a measure of impurity or purity used while creating a decision tree in the CART(Classification and Regression Tree) algorithm. An attribute with the low Gini index should be preferred as compared to the high Gini index. It only creates binary splits, and the CART algorithm uses the Gini index to create binary splits. Gini index can be calculated using the below formula: \[ Gini \hspace{1.25mm} Index = 1 - \sum_{j}P_j^2 \]

*Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal decision tree.*

A too-large tree increases the risk of overfitting, and a small tree may not capture all the important features of the dataset. Therefore, a technique that decreases the size of the learning tree without reducing accuracy is known as Pruning. There are mainly **two types of tree pruning technology** used:

- Cost Complexity Pruning
- Reduced Error Pruning

- Simple to understand, interpret and visualize.
- Little effort is required is required for data preparation.
- It can handle both numerical and categorical data.
- Non linear parameters don't effect its performance i.e. even if the data doesn't fit an easy curved graph, we can use it to create an effective decision.

**Overfitting:**Overfitting occurs when an algorithm captures noise in the data. So, it starts solving for one specific case rather than a general solution.**High Variance:**The model can get unstable due to small variations in data.**Low biased Tree:**A highly complicated Decision Tree tends to have a low bias which makes it difficult for the model to work with new data.

A Naive Bayes classifier is a supervised machine learning model thats used for classification task. As a classifier, it is used in face recognition, weather prediction, medical diagnosis, news classification, spam filtering, etc. Naive Bayes' Classifier is used for classification tasks in large datasets. This model can be modified with new training data without having to rebuild the samples. The crux of the classifier is based on the **Bayes' theorem**.

It is called Nave because it assumes that the occurrence of a certain feature is **independent** of the occurrence of other features. Such as if the fruit is identified on the bases of colour, shape and taste, then red, spherical and sweet fruit is recognised as an apple. Hence each feature individually contributes to identify that it is an apple without depending on each other.

Naive Bayes Classifier works on the principles of conditional probability as given by the **Bayes' Theorem**.

Bayes' Theorem gives the conditional probability of an event A given another event B has occurred.
\[
P(A|B) = \frac{P(B|A).P(A)}{P(B)}
\]
Using Bayes theorem, we can find the probability of **A** happening, given that **B** has occurred. Here, **B** is the evidence and **A** is the hypothesis. The assumption made here is that the predictors/features are independent. That is presence of one particular feature does not affect the other. Hence it is called naive.

This is mostly used for document classification problem, i.e whether a document belongs to the category of sports, politics, technology etc. The features/predictors used by the classifier are the frequency of the words present in the document.

This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

Since the way the values are present in the dataset changes, the formula for conditional probability changes to, \[ P(x_i | y) = \frac{1}{\sqrt{2\pi\sigma_y^2}}exp(-\frac{(x_i - \mu_y)^2}{2\sigma_y^2}) \]

]]>Logistic Regression is a supervised machine learning algorithm used to classify a target variable based on one or more continuous or discrete predictor features.

It is same as linear regression on the inside, only difference being, it uses a sigmoid function to predict the output rather than linear function used in linear regression. It can be used in binary classification problems. In binary classification, the value of output ranges between [0, 1].

As we know, the equation for linear regression is: \[ Y = a_0 + a_1X \]

This equation outputs a continuous numerical value. But, while using the logistic regression, we expect discrete outputs. This can be achieved by passing the value of Y through a logistic function. \[ f(Y) = \frac{1}{1 + e^{-Y}} \]

This can also be written as: \[ f(Y) = \frac{1}{1+ e^{-(a_0 + a_1X)}} \]

We saw above that logistic regression is used to predict the classes of target variable. For that, we have to pass the output of linear regression through a logistic/sigmoid function.

A sigmoid function is the one whose output ranges from 0 to 1. Below is the graph for sigmoid function.

[PICTURE]

We can see here that the data points are either at 0 or 1. In logistic regression, we specify a threshold value. Any value greater than than the threshold will be considered positive or True or 1 and any value less than the threshold value will be considered as negative or False or 0.

Cost function is the function that represents the average error between actual and predicted values. Minimizing the cost function results in building an accurate model.

You might remember the cost function we used in linear regression, let me tell you, it won't work here in case of logistic regression. In linear regression, Mean Squared Error(MSE) gave us a convex function, so that we could apply gradient descent on it to minimize the cost. In case of logistic regression, \(\hat{Y}\) is a non-linear function, if we apply MSE on it, it will give us a non-convex function.

There will be complications if we apply gradient descent technique to minimize a non-convex function.

Another reason is in classification problems, we have target values like 0/1. So, \((\hat{Y} - Y)^2\) will always be in between 0-1, which can make it difficult to keep track of the errors and it is difficult to store high precision floating numbers.

Thus we use another cost function in case of logistic regression. This cost function is called **binary cross entropy** aka **log loss**.

Log loss is the most important classification metric based on probabilities. It's hard to interpret raw log-loss values, but log-loss is still a good metric for comparing models. For any given problem, a lower log-loss value means better predictions.

Mathematical Interpretation,

Log loss is the negative average of the log of corrected predicted probabilites for each instance.

There are three steps to find Log Loss:

- To find corrected probabilities.
- Take a log of the corrected probabilites.
- Take the negative average of the values we get in 2nd step.

For example,

[PICTURE]

The model is giving predicted probabilites as shown above:

By default, the output of the logistic regression is the probability of the sample being positive (indicated by 1) i.e. if a logistic regression model is trained to classify on a 'company dataset' then the predicted probability column says what is the probability that the person has bought jacket. Here in the above dataset the probability that a person with ID6 will by a jacket is 0.94.

In the same way, the probability that a person with ID5 will buy a jacket (i.e. belong to class 1) is 0.1 but the actual class for ID5 is 0, so the probability for the class is (1 - 0.1) = 0.9. So, the corrected probability for ID5 is 0.9.

[PICTURE]

We will find a log of corrected probabilities for each instance.

[PICTURE]

As you can see these log values are negative, we take negative average of these values, to maintain a common convention that lower loss scores are better. \[ \textrm{log loss} = - \frac{1}{N}\sum_{i=1} ^{N} log(P_i) \]

The benefits of taking logarithm in the cost function reveal themselves when you look at the cost function graphs for actual class 0 and 1.

[Graph here]

- The Red line represents 1 class. As we can see, when the predicted probability (x-axis) is close to 1, the loss is less and when the predicted probability is close to 0, loss approaches infinity.
- The Black line represents 0 class. As we can see, when the predicted probability (x-axis) is close to 0, the loss is less and when the predicted probability is close to 1, loss approaches infinity.

**Confusion Matrix:**A confusion matrix is a table that is used to define the performance of a classification algorithm. A confusion matrix visualizes and summarizes the performance of a classification algorithm.**Classification Report:**There are 3 parameters in this:**Precision:**It is defined as the total number of positive patterns predicted correctly, by total number of patterns in positive class.**Recall:**It is the fraction of patterns correctly classified.**F1 Score:**This is defined as the harmonic mean between Precision and Recall values.

**ROC Curve:**"Receiver Operating Characteristic Curve" aka ROC Curve is the score which lies between 0 to 1. Here, 0 stands for bad and 1 stands for good. 0.5 is better. If ROC score is 0.78 then it means 78% of predicted values are correct and rest 22% are predicted wrongly.**Accuracy Score:**Usual metrics to predict the overall accuracy of the model.

Linear Regression is a supervised machine learning algorithm which finds a best fit line between a dependent variable and one or more independent variables. This means that linear regression finds a linear relationship between dependent and independent variables.

It is generally used to predict the trend hidden in the features. It's output is continuous.

The best fit line is the one for which total prediction error (all data points) is as small as possible. Error is the distance between the data point to the regression line.

Linear Regression is of two types:

**Simple Linear Regression:**In simple linear regression, the model has to find best fit line between one target and one predictor feature.**Multiple Linear Regression:**In multiple linear regression, there is one target feature and more than one predictor features.

Mathematically, a simple linear regression model can be represented as: $$Y = a_0 + a_1X$$ Here,

- \(Y\) = Target/Dependent Variable
- \(X\) = Predictor/Independent Variable
- \(a_1\) = Coefficient or Slope
- \(a_0\) = Y - Intercept

This equation might sound similar to you. Yes, it is the equation of a straight line in slope-intercept form.

Linear Regression model trains on \(X\) and \(Y\) features to estimate the values of \(a_0\) and \(a_1\), which then are stored for future predictions.

Errors are calculated between the actual values of target variable(\(Y\)) and predicted values of target variable(\(\hat{Y}\)).

Error = Actual values (\(Y\)) - Predicted values (\(\hat{Y}\))

Sum of Errors = \(\sum(Y - \hat{Y})\)

Square of Sum of Errors = \(\sum(Y - \hat{Y})^2\)

Mathematically, \[ Cost(J) = \sum (Y_i - \hat{Y_i} )^2 \] If the error between actual and predicted values is high, we have to minimize the error to get the best model. This can be done using Gradient Descent technique.

Gradient Descent is used to minimize the error or cost of our model. To understand gradient descent in simple terms, let's think of a goat at the top of a mountain wanting to climb down. It can't just jump off the mountain but will descent in the direction of slope one step at a time. This analogy is used in gradient descent. Each step of the goat in our case is one iteration where we update the parameters.

[PICTURE]

Learning Rate is very important in gradient descent. It determines the size of step the goat takes at a time. If the step is too large, we may miss the local minimum. And if the step is too small, it may take many iterations to arrive at the minimum. It is denoted by alpha().

[PICTURE]

Now, you may be wondering, how do we actually minimize the error using gradient descent. Let's see that:

Here, we'll need some knowledge of calculus. \[ J = \frac{1}{n} \sum_{i=1} ^{n} (pred_i - y_i)^2 \]

\[ J = \frac{1}{n} \sum_{i=1} ^{n} ( a_0 + a_1.x_i - y_i )^2 \]

\[ \frac{\partial J}{\partial a_0} = \frac{2}{n} \sum_{i=1} ^{n} (a_0 + a_1 . x_i - y_i) \Rightarrow \frac{\partial J}{\partial a_0} = \frac{2}{n} \sum_{i=1} ^{n} ( pred_i - y_i) \]

\[ \frac{\partial J}{\partial a_1} = \frac{2}{n} \sum_{i=1} ^{n} (a_0 + a_1 . x_i - y_i) . x_i \Rightarrow \frac{\partial J}{\partial a_1} = \frac{2}{n} \sum_{i=1} ^{n} ( pred_i - y_i) . x_i \]

To minimize the error, we have to iteratively update the values of \(a_0\) and \(a_1\). To update \(a_0\) and \(a_1\), we take gradients from our cost function. This can be done by taking partial derivatives of cost function with respect to \(a_0\) and \(a_1\). After getting the gradients of \(a_0\) and \(a_1\), we update the previous values of \(a_0\) and \(a_1\) iteratively.

We also have to specify a learning rate() while updating the parameters. As we discussed above, learning rate is the size of step that gradient descent takes to converge to the minimum.

\[ a_0 = a_0 - \alpha . \frac{2}{n} \sum_{i=1} ^{n} (pred_i - y_i) \]

\[ a_1 = a_1 - \alpha . \frac{2}{n} \sum_{i=1} ^{n} (pred_i - y_i) . x_i \]

To understand the performance of the Regression model performing model evaluation is necessary. Some of the Evaluation metrics used for Regression analysis are:

**R squared or Coefficient of Determination:**The most commonly used metric for model evaluation in regression analysis is R squared. It can be defined as a Ratio of variation to the Total Variation. The value of R squared lies between 0 to 1, the value closer to 1 the better the model. \[ R^2 = 1 - \frac{SS_{RES}}{SS_{TOT}} = \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} \] where, \(SS_{RES}\) is the Residual Sum of squares and \(SS_{TOT}\) is the Total Sum of squares.**Adjusted R squared:**It is the improvement to R squared. The problem/drawback with \(R^2\) is that as the features increase, the value of \(R^2\) also increases which gives the illusion of a good model. So the*Adjusted*\(R^2\) solves the drawback of \(R^2\). It only considers the features which are important for the model and shows the real improvement of the model.*Adjusted*\(R^2\) is always lower than \(R^2\). $$ R^2_{adjusted} = 1 - \frac{(1 - R^2)(N - 1)}{N - p - 1} $$ where,- \(R^2\) = sample R - square
- \(p\) = Number of predictors
- \(N\) = Total sample size.

**Mean Squared Error (MSE):**Another Common metric for evaluation is Mean squared error which is the mean of the squared difference of actual vs predicted values. \[ MSE = \frac{1}{n} \sum_{i=1} ^{n} ( y_i - \hat{y_i})^2 \]**Root Mean Squared Error (RMSE):**It is the root of MSE i.e Root of the mean difference of Actual and Predicted values. RMSE penalizes the large errors whereas MSE doesnt. \[ RMSE = \sqrt{\frac{1}{n} \sum_{i=1} ^{n} (y_i - \hat{y_i})^2} \]

Machine Learning is said as a subset of artificial intelligence that is mainly concerned with the development of algorithms which allow a computer to learn from the data and past experiences on their own. The term machine learning was first introduced by Arthur Samuel in 1959. We can define it in a summarized way as:

Machine learning enables a machine to automatically learn from data, improve performance from experiences, and predict things without being explicitly programmed.

With the help of sample historical data, which is known as training data, machine learning algorithms build a mathematical model that helps in making predictions or decisions without being explicitly programmed. Machine learning brings computer science and statistics together for creating predictive models. Machine learning constructs or uses the algorithms that learn from historical data. The more we will provide the information, the higher will be the performance.

- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning

Supervised learning is a type of machine learning method in which we provide sample labeled data to the machine learning system in order to train it, and on that basis, it predicts the output.

The goal of supervised learning is to map input data with the output data. The supervised learning is based on supervision, and it is the same as when a student learns things in the supervision of the teacher. The example of supervised learning is spam filtering.

Supervised learning can be grouped further in two categories of algorithms:

**Classification****Regression**

Unsupervised learning is a learning method in which a machine learns without any supervision.

The training is provided to the machine with the set of data that has not been labeled, classified, or categorized, and the algorithm needs to act on that data without any supervision. The goal of unsupervised learning is to restructure the input data into new features or a group of objects with similar patterns.

In unsupervised learning, we don't have a predetermined result. The machine tries to find useful insights from the huge amount of data. It can be further classifieds into two categories of algorithms:

**Clustering****Association**

Reinforcement learning is a feedback-based learning method, in which a learning agent gets a reward for each right action and gets a penalty for each wrong action. The agent learns automatically with these feedbacks and improves its performance. In reinforcement learning, the agent interacts with the environment and explores it. The goal of an agent is to get the most reward points, and hence, it improves its performance.

The robotic dog, which automatically learns the movement of his arms, is an example of Reinforcement learning.

]]>