Introduction
The objective of the support vector machine algorithm is to find a hyperplane in an N-dimensional space (N — the number of features) that distinctly classifies the data points.
Hyperplanes are decision boundaries that help classify the data points. The dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.
Support vectors are data points that are closer to the hyperplane and influence the position and orientation of the hyperplane. Using these support vectors, we maximize the margin of the classifier. Deleting the support vectors will change the position of the hyperplane. These are the points that help us build our SVM.
In SVM, we take the output of the linear function and if that output is greater than 1, we identify it with one class and if the output is -1, we identify is with another class. Since the threshold values are changed to 1 and -1 in SVM, we obtain this reinforcement range of values([-1,1]) which acts as margin.
How SVMs work
To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.
- Identify the right hyper-plane (Scenario-1): Here, we have three hyper-planes (A, B, and C). Now, identify the right hyper-plane to classify stars and circles.
- Identify the right hyper-plane (Scenario-2): Here, we have three hyper-planes (A, B, and C) and all are segregating the classes well. Now, How can we identify the right hyper-plane?
- Identify the right hyper-plane (Scenario-3): Hint: Use the rules as discussed in previous section to identify the right hyper-plane
- Can we classify two classes (Scenario-4)?: Below, I am unable to segregate the two classes using a straight line, as one of the stars lies in the territory of other(circle) class as an outlier.
- Find the hyper-plane to segregate to classes (Scenario-5): In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have only looked at the linear hyper-plane.
Cost Function
In the SVM algorithm, we are looking to maximize the margin between the data points and the hyperplane. The loss function that helps maximize the margin is hinge loss. \[ c(x, y, f(x)) = \left \{ \begin{array}{ c l } 0, & \quad \textrm{if } y * f(x) \geq 1 \\ 1 - y * f(x), & \quad \textrm{otherwise} \end{array} \right. \]
\[ c(x, y, f(x)) = (1 - y * f(x))_+ \]
The cost is 0 if the predicted value and the actual value are of the same sign. If they are not, we then calculate the loss value. We also add a regularization parameter the cost function. The objective of the regularization parameter is to balance the margin maximization and loss. After adding the regularization parameter, the cost functions looks as below. \[ min_w\lambda || w ||^2 + \sum_{i=1}^{n}(1-y_i\langle x_i, w \rangle )_+ \]
Cost Minimization
Now that we have the loss function, we take partial derivatives with respect to the weights to find the gradients. Using the gradients, we can update our weights. \[ \frac{\delta}{\delta w_k} \lambda ||w||^2 = 2 \lambda w_k \]
\[ \frac{\delta}{\delta w_k} (1 - y_i \langle x_i, w \rangle )_+ = \left \{ \begin{array}{ c l } 0, & \quad \textrm{if } y_i\langle x_i, w \rangle \geq 1 \\ -y_ix_{ik}, & \quad \textrm{otherwise} \end{array} \right. \]
When there is no misclassification, i.e our model correctly predicts the class of our data point, we only have to update the gradient from the regularization parameter. \[ w = w - \alpha \cdot (2\lambda w) \]
When there is a misclassification, i.e our model make a mistake on the prediction of the class of our data point, we include the loss along with the regularization parameter to perform gradient update. \[ w = w + \alpha \cdot (y_i \cdot x_i - 2\lambda w) \]
Kernel Functions
Linear
These are commonly recommended for text classification because most of these types of classification problems are linearly separable.
The linear kernel works really well when there are a lot of features, and text classification problems have a lot of features. Linear kernel functions are faster than most of the others and you have fewer parameters to optimize.
Here's the function that defines the linear kernel: \[ f(X) = w^T*X + b \]
Polynomial
The polynomial kernel isn't used in practice very often because it isn't as computationally efficient as other kernels and its predictions aren't as accurate.
Here's the function for a polynomial kernel: \[ f(X_1, X_2) = (a + X_1^T * X_2)^b \]
This is one of the more simple polynomial kernel equations you can use. \(f(X_1, X_2)\) represents the polynomial decision boundary that will separate your data. \(X_1\) and \(X_2\) represent your data.
Gaussian Radial Basis Function (RBF)
One of the most powerful and commonly used kernels in SVMs. Usually the choice for non-linear data.
Here's the equation for an RBF kernel: \[ f(X_1, X_2) = exp(-\gamma * ||X_1 - X_2||^2) \]
In this equation, \(\gamma\) specifies how much a single training point has on the other data points around it. \(||X_1 - X_2||\) is the dot product between your features.
Sigmoid
More useful in neural networks than in support vector machines, but there are occasional specific use cases.
Here's the function for a sigmoid kernel: \[ f(X, y) = \tanh(\alpha * X^T * y + C) \]
In this function, \(\alpha\) is a weight vector and \(C\) is an offset value to account for some mis-classification of data that can happen.
Others
There are plenty of other kernels you can use for your project. This might be a decision to make when you need to meet certain error constraints, you want to try and speed up the training time, or you want to super tune parameters.
Some other kernels include: ANOVA radial basis, hyperbolic tangent, and Laplace RBF.