A42: Support Vector Machines (SVMs) [Behind The Scene!]
Maximum margin hyperplane, cross-entropy vs hinge loss, kernel trick, slack variables, norm, rbf kernel, origin of perceptron, regularizing hyper-parameter
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
⚠️ We will be creating a dataset for learning purpose in this lecture.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
🧘🏻♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻♂️ 👉🎯
Introduction to SVMs
- A recall on regression for classification
- Support Vector Machine — The SVM
- How does the SVM classify?
- The maximum margin hyperplane
- Why maximize the margin?
- SVM origins the perceptron algorithm
- Finding the maximum margin
- The hinge loss and non-linearly separable cases
- Hinge loss and slack
- “C “— The regularizing hyper-parameter
- SVM in action — visualize the working
11.1: Sample data
11.2: Linear kernel - The kernel trick for non-linearly separable data
12.1: Polynomial kernel
12.2: RBF kernel - Advantages-and-disadvantages
- SVM vs Logistic Regression — when to use
- New terms
15.1: Slack Variables
15.2: Norm - Additional resources
1. A recall on regression for classification
Recall the regression based algorithms, in linear regression, we use a best fitted line to predict a continuous target, right?
What if we try to use the linear regression algorithm to predict some class/es (e.g. 0/1), it does not sound like a good idea!
We typically convert the categorical targets/labels to integer classes (0/1).
Well, to accomplish a classification task, we would rather think about the line as a boundary that splits the space instead of fitting the points.
Here it comes the logistic regression, where we have learned how transfer functions can be used to tackle the classification problem using linear regression based classification algorithm. The logistic transfer function converts the real numbers to the probabilities and the algorithm then make use of these class probabilities as a proxy for the class predictions. Well, this created a quandary, as we are tackling a problem where the answer is no/yes or 0/1 by solving the logistic regression problem, and our typical old loss functions (MSE, MAE etc) are not helpful. We had to use log-loss or binary cross-entropy loss function to get this done.
👉 see eq. 5.10 in section 5.3 — The cross-entropy loss function
Cross-entropy loss function returns a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0 for the accuracy of 1.
We said the best possible line as a boundary, and ideally we expect this boundary to make as few classification mistakes as possible. For evaluation, we look for the true label/s of the class; if it matched to the predicted label, the loss is 0 and if it does not match, the loss is 1.
If we think about this misclassification loss, we actually are treating every misclassification equally bad. However, some classifications could be worse than others (based on how close they are to the decision boundary). We can imagine, how bad such linear separation for the two classes could be.
Can we think about some alternate approach to deal with “equally bad” as “how bad”?
Let’s see how SVMs is helpful!
2. Support Vector Machine — The SVM
What if we could think about treating a misclassification/s based on how bad
they have been misclassified, like impose stronger penalties for the observations that are found deeper in the territory of the other class. Farther the misclassification/s are from the decision boundary, the more wrong they are, thus deserves a higher penalty. In fact, we would not mind a margin for error as well and even correctly classified observations that are really close to the boundary (almost misclassified) could contribute to the penalty. Make sense, right!
Well, this is where the Support Vector Machine (SVM) algorithm comes in, which follows a different approach for classification. The algorithm still fits a decision boundary like a logistic regression, but uses a different loss function called “The Hinge Loss”, an alternative to cross-entropy for binary classification problems and used for maximum-margin classification. It is primarily developed to use with Support Vector Machine (SVM) models, and for binary class classification where the targets are -1/1
. The function encourages the observations to have the correct sign while assigning more error where there is a sign difference between the true and predicted class labels.
3. How does the SVM classify?
In the Figure 1 below, we have two types of example datasets for classification; linearly separable and not linearly separable.
Let’s start with the linearly separable example (Figure 2) to see how SVM algorithm work for such classification examples. The algorithm fits a decision boundary, defined by the largest margin between the closest points for each class. This is commonly called the maximum margin hyperplane (MMH).
The intuition behind finding the decision boundary is very simple and the algorithm look for the surface that is maximally far away from any data point between classes. The points that are used by SVM to fit that boundary are the support vectors.
In the above Figure 2, two decision boundaries are shown (red and green) along with their margins in dotted lines. The one in green have wider margin as compared to the one in red. The double arrowed lines are the respective distances between support vectors and the decision boundaries. In green, is a maximum margin hyperplane.
4. The maximum margin hyperplane
For linearly separable data, we can select two parallel hyperplanes (the dotted lines in Figure 3 — given below) that separate the two classes of data, so that the distance between them is as large as possible, and the maximum-margin hyperplane is the hyperplane that lies halfway between them and the bounded region between them is a margin. With a normalized or standardized dataset, these hyperplanes can be described by the equations (We choose a normalizing constant such that the distance from the plane to the closest points (the support vectors) of either classes will be 1):
5. Why maximize the margin?
SVM solves for a decision boundary that should theoretically minimize the generalization error. The observations that are near the decision boundary, separating the classes, are those with the most “ambiguous” label/s. They are the observations that are approaching equal probability to be one class or the other (given the predictors).
The SVM is concerned with generalization to new data. Instead of considering all the observations “equally”, the loss function (hinge loss) that the SVM minimizes, defines it’s fit using the most ambiguous points (see the figure above with green and red planes, the points for green are the most ambiguous defining the maximum margin plane). It’s decision boundary is safe in that errors and the new observations are not likely to cause the SVM to mis-classify.
6. SVM origins: the perceptron algorithm
While discussing the SVM, it is very relevant to look at the perceptron algorithm, which is a linear function of weight/s ‘𝑤’ and predictors/features 𝑋 that assigns points to binary classes. If the function returns a value greater than 0, the observation is classified as 1, otherwise it is classified as zero.
Given below, ‘𝑓(𝑋)’ is the perceptron function for classification, 𝑏 (also represented by 𝑤_0 sometimes) is known as the “bias”, which is analogous to the intercept term.
If the points are linearly separable, then solving the perceptron is guaranteed to converge on a solution, but that solution is not necessarily optimal for future observations. This led to the invention of the SVM, which finds the optimal discriminator; the maximum margin hyperplane.
7. Finding the maximum margin
So far, we have learned about all the relevant concepts for SVM, and a valid question is how can we find the widest possible plane. We actually maximize the value of margin: 2/||𝑤||.
So, maximize the weights with constraints on what the function can output;
- When the target is 1, the function needs to be 1 or greater.
- When the target is 0 (or -1), the function needs to output -1 or lower.
Please note, our ‘𝑦’ is specified as 1/-1, as opposed to the 0/1 that we are used to. It is convenient for converting this to be a minimization. To make this a minimization, we can change the equation to be:
Which works out because when 𝑦=−1 the values become positive. (see the equations above)
8. The hinge loss and non-linearly separable cases
Consider the case when there is no line/plane that can separate all the observations perfectly, here, we need to introduce the capacity for model error. With the constraint “𝑦⋅(𝑤𝑇⋅𝑋+𝑏)≥1“ (also given above), we can introduce the hinge loss function:
Where as, for correct prediction/s, when:
- 𝑓(𝑥_𝑖)>1: the point lies outside the margin and does not contribute to the loss.
- 𝑓(𝑥_𝑖)=1: the point is on the margin and does not contribute to the loss.
- 𝑓(𝑥_𝑖)<1: the point lies inside the margin and does contribute to the loss accordingly.
So, in general, whenever 𝑓(𝑥_𝑖) is NOT ≥1, the function (penalty) will be greater than zero and contribute to the loss accordingly for the respective datapoint/observation.
9. Hinge loss and “slack”
Suppose a situation when it is not possible to perfectly separate the classes, the hinge loss with a regularization parameter ‘𝐶’ is helpful:
𝜖_𝑖 is the errors from the algorithm/classifier, and 𝐶 is a regularization term, which determines how much the classification errors matter (relative to the maximization of the margin).
Now, the function that the SVM minimizes to find the boundary will be:
👉 A small value of 𝐶 creates a wider margin because errors will matter less.
👉 A large 𝐶 creates a tighter margin because errors matter more.
👉 An infinite 𝐶 parameter is a “hard” margin, which always minimizes error over the size of the boundary.
==>
We are trying to minimize the norm of the weights, ||𝑤||, and thus maximize the margin, but now we are also trying to minimize our errors at the same time. While minimization, we are looking for a balance between how wide the margin should be and how much error we can tolerate.
10. 𝐶— The regularizing hyper-parameter
- By setting the hyper-parameter 𝐶 we can control the extent to which our SVM is tolerant to misclassification errors. It is sometimes called the soft-margin constant.
- 𝐶 affects the strength of the penalty, similar to the lambda terms in the Ridge and Lasso.
- By multiplying the sum of the errors, which are the distances from the margins to the points inside of the margin, it allows the SVM to classify non-linearly separable problems by allowing errors to occur.
- The lower the value of 𝐶, the more misclassified observations errors are allowed. These misclassified points are known as “slack variables”. Reducing the effect of errors on the loss function puts the emphasis on widening the margin.
If you are interested to explore the maths, section 7.1.1 Overlapping class distributions in Pattern Recognition And Machine Learning by Christopher Bishop would be greatly helpful Fig 7.3 is worth looking at to understand.
NOTE:
In real life, most of the datasets are not linearly separable, and there will be no way to satisfy all the constraints in the equation to find maximum margin hyperplane. We can still handle such datasets, while learning a useful classifiers by introducing slack variables. This loosen and/or allow violation of certain constrains, i.e, we allow certain training points to be within the margin (a soft margin), however, we want this number to be as small as possible, and also want their penetration of the margin to be as small as possible. (we will talk about optimization along the way)
11. SVM in action — visualize the working
Now, we have enough understanding of support vector machines algorithm, let’s move and create interactive plots to see the effect of C parameter.
We need to import required libraries, create a dataset first.
# required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from ipywidgets import interact
%matplotlib inline#Retina display to see better quality images.
%config InlineBackend.figure_format = ‘retina’# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)
11.1: Sample data
Let’s create a dataset for binary class classification.
So, now we have the data, let’s move on and write some functions to create the interactive plots and understand the effect of C parameter. A good programming practice!
11.2: Linear kernel
12. The “kernel trick” for non-linearly separable data
“kernel trick” is one of the big reason for the popularity of SVMs. This is done by “wrapping” our features (X) in a kernel function that transforms them into this higher dimensional space, which allows the algorithm to classify non-linearly separable data. (example figure below)
The idea behind the kernel trick is to arbitrarily transform the non-linearly separable observations into a different (higher) dimensional space, where they can have linear separability. We fit the algorithm in this new higher dimensional space and then invert the transformation of the data and the model itself back into the original space.
In summary, data may be linearly separable in higher dimensional space, and using kernel-trick, the classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space.
The figure below should give the general intuition for what is happening in the kernel-trick.
12.1: Polynomial kernel
12.2: RBF kernel
Takeaway: Bias/variance trade-off is handled via the hyper-parameter 𝐶
- Large 𝐶 → narrow margin, less tolerant of misclassification, tends toward high variance
- Small 𝐶 → wider margin, more tolerant of misclassification, tends toward high bias
13. Advantages and disadvantages
SVMs is widely used algorithm in business setup and have some advantages and disadvantages.
Advantages:
- Robust to outliers
- Effective in high dimensional data
- Can work with non-linearities
- Fast to compute even on non-linear (kernel trick)
- Low risk of over-fitting
- Exceptional performance
Disadvantages:
- Black-box
- Can be slow on large datasets
14. SVM vs Logistic Regression — when to use
Recall “no free lunch theorem”, when to use which algorithm is a tricky questions, however according to the expert advise:
- If there are more feature than training samples: Use logistic regression or SVM without a kernel (linear kernel)
- If there are about 10 times as many samples as features: Use SVM with a Gaussian kernel
- If there are many more training samples than features: Spend time feature engineering, then use logistic regression or SVM without a kernel (linear kernel)
15) New terms
15.1: Slack Variables:
In optimization, particularly in linear programming problems, variables that are introduced for inequalities constraints so that it can be transformed into equality. An equation will introduce one slack variable, e.g:
x + 5y
≤50
can be written asx + 5y + sv - 50 = 0
after introducingsv
as a slack variable to replace inequality. This is helpful and allow us to use simplex method in mathematical optimization.- Convex Optimization by Stephen Boyd and Lieven Vandenberghe is overall a great read.
4.1.3:
Equivalent problems and4.2.4:
Equivalent convex problems could be helpful to understand more on slack variables.
15.2: Norm:
Norm is a function that takes our vector space (real or complex) as an input and returns a non-negative real number telling how big our vector is, e.g. distance traveled etc. A good read
16. Additional resources
🧘🏻♂️ Nice lectures slides by Prof. Andrew Zisserman, Oxford University, UK.
🧘🏻♂️ A great resource from Stanford University on Support vector machines and machine learning on documents
- A User’s Guide to Support Vector Machines — very useful!
🧘🏻♂️ Nice blog posts series on SVMs:
- SVMs — An overview of Support Vector Machines
- SVM — Understanding the math — Part 1 — The margin
- SVM — Understanding the math — Part 2
- SVM — Understanding the math — the optimal hyperplane
🧘🏻♂️ SVMs Tutorial slides by Jason Weston, NEC Labs America
🧘🏻♂️ Lecture slides by R. Berwick from University of Central Florida: An Idiot’s guide to Support vector machines (SVMs)
🧘🏻♂️ Support Vector machines lecture notes from University of Toronto, Canada.
🧘🏻♂️ A paper by Prof. Thorsten Joachims on Text Categorization with Support Vector Machines: Learning with Many Features
🧘🏻♂️ An advanced discussion of SVMs as probabilistic models
🧘🏻♂️ A Tutorial on Support Vector Machines for Pattern Recognition
🧘🏻♂️ on scikit-learn:
🧘🏻♂️ Some nice video lectures:
- Support Vector Machines by Prof. Yaser Abu-Mostafa, Caltech.
- Kernel Methods by Prof. Yaser Abu-Mostafa, Caltech.
- Learning: Support Vector Machines by Prof. Patrick H Winston MIT, USA.
- Practical Machine Learning: Support Vector Machines — Stanford Crowd Course Initiative.
🧘🏻♂️ Some links on loss functions to refresh:
- Non-convex optimization — Nice Slides from University of British Columba, Canada.
- Minimizing The Misclassification Error Rate Using a Surrogate Convex Loss — The paper shows that, the hinge loss is essentially optimal among all convex losses.
- Are Loss Functions All the Same? — A good read!
- The Hinge Loss — Video Lecture
So, this was all about on Support Vector Machines (SVMs) at the moment!
*******************************************************************
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A43: Support Vector Machines (SVMs) — Hands-on [complete project with code]”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
- Books on leanpub
- SkillShare link (two free months for new subscribers)
- Free on YouTube
- ScienceAcademy
- https://karobarklinik.com — for all your digital needs!
**************************************************************************************************************************************
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.
**************************************************************************************************************************************