A42: Support Vector Machines (SVMs) [Behind The Scene!]

Maximum margin hyperplane, cross-entropy vs hinge loss, kernel trick, slack variables, norm, rbf kernel, origin of perceptron, regularizing hyper-parameter

Junaid Qazi, PhD
14 min readMar 3, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A41: Bootstrapping and Confidence Interval — Behind the scene!!”

💐Click here to FOLLOW ME for new contents💐

⚠️ We will be creating a dataset for learning purpose in this lecture.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻‍♂️ 👉🎯

Introduction to SVMs

  1. A recall on regression for classification
  2. Support Vector Machine — The SVM
  3. How does the SVM classify?
  4. The maximum margin hyperplane
  5. Why maximize the margin?
  6. SVM origins the perceptron algorithm
  7. Finding the maximum margin
  8. The hinge loss and non-linearly separable cases
  9. Hinge loss and slack
  10. “C “— The regularizing hyper-parameter
  11. SVM in action — visualize the working
    11.1: Sample data
    11.2: Linear kernel
  12. The kernel trick for non-linearly separable data
    12.1: Polynomial kernel
    12.2: RBF kernel
  13. Advantages-and-disadvantages
  14. SVM vs Logistic Regression — when to use
  15. New terms
    15.1: Slack Variables
    15.2: Norm
  16. Additional resources

1. A recall on regression for classification

Recall the regression based algorithms, in linear regression, we use a best fitted line to predict a continuous target, right?

What if we try to use the linear regression algorithm to predict some class/es (e.g. 0/1), it does not sound like a good idea!

We typically convert the categorical targets/labels to integer classes (0/1).

Well, to accomplish a classification task, we would rather think about the line as a boundary that splits the space instead of fitting the points.

Here it comes the logistic regression, where we have learned how transfer functions can be used to tackle the classification problem using linear regression based classification algorithm. The logistic transfer function converts the real numbers to the probabilities and the algorithm then make use of these class probabilities as a proxy for the class predictions. Well, this created a quandary, as we are tackling a problem where the answer is no/yes or 0/1 by solving the logistic regression problem, and our typical old loss functions (MSE, MAE etc) are not helpful. We had to use log-loss or binary cross-entropy loss function to get this done.

👉 see eq. 5.10 in section 5.3 — The cross-entropy loss function

Cross-entropy loss function returns a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0 for the accuracy of 1.

We said the best possible line as a boundary, and ideally we expect this boundary to make as few classification mistakes as possible. For evaluation, we look for the true label/s of the class; if it matched to the predicted label, the loss is 0 and if it does not match, the loss is 1.

If we think about this misclassification loss, we actually are treating every misclassification equally bad. However, some classifications could be worse than others (based on how close they are to the decision boundary). We can imagine, how bad such linear separation for the two classes could be.

Can we think about some alternate approach to deal with “equally bad” as “how bad”?

Let’s see how SVMs is helpful!

2. Support Vector Machine — The SVM

What if we could think about treating a misclassification/s based on how bad they have been misclassified, like impose stronger penalties for the observations that are found deeper in the territory of the other class. Farther the misclassification/s are from the decision boundary, the more wrong they are, thus deserves a higher penalty. In fact, we would not mind a margin for error as well and even correctly classified observations that are really close to the boundary (almost misclassified) could contribute to the penalty. Make sense, right!

Well, this is where the Support Vector Machine (SVM) algorithm comes in, which follows a different approach for classification. The algorithm still fits a decision boundary like a logistic regression, but uses a different loss function called “The Hinge Loss”, an alternative to cross-entropy for binary classification problems and used for maximum-margin classification. It is primarily developed to use with Support Vector Machine (SVM) models, and for binary class classification where the targets are -1/1. The function encourages the observations to have the correct sign while assigning more error where there is a sign difference between the true and predicted class labels.

3. How does the SVM classify?

In the Figure 1 below, we have two types of example datasets for classification; linearly separable and not linearly separable.

Figure 1: The author has created this image and can be used under fair use policy. Reference is requested.

Let’s start with the linearly separable example (Figure 2) to see how SVM algorithm work for such classification examples. The algorithm fits a decision boundary, defined by the largest margin between the closest points for each class. This is commonly called the maximum margin hyperplane (MMH).

The intuition behind finding the decision boundary is very simple and the algorithm look for the surface that is maximally far away from any data point between classes. The points that are used by SVM to fit that boundary are the support vectors.

Figure 2: Two decision boundaries are shown along with their margins in dotted lines. Which one to select? The author has created this image and can be used under fair use policy. Reference is requested.

In the above Figure 2, two decision boundaries are shown (red and green) along with their margins in dotted lines. The one in green have wider margin as compared to the one in red. The double arrowed lines are the respective distances between support vectors and the decision boundaries. In green, is a maximum margin hyperplane.

4. The maximum margin hyperplane

For linearly separable data, we can select two parallel hyperplanes (the dotted lines in Figure 3 — given below) that separate the two classes of data, so that the distance between them is as large as possible, and the maximum-margin hyperplane is the hyperplane that lies halfway between them and the bounded region between them is a margin. With a normalized or standardized dataset, these hyperplanes can be described by the equations (We choose a normalizing constant such that the distance from the plane to the closest points (the support vectors) of either classes will be 1):

Figure 3: The author has created this image and can be used under fair use policy. Reference is requested.

5. Why maximize the margin?

SVM solves for a decision boundary that should theoretically minimize the generalization error. The observations that are near the decision boundary, separating the classes, are those with the most “ambiguous” label/s. They are the observations that are approaching equal probability to be one class or the other (given the predictors).

The SVM is concerned with generalization to new data. Instead of considering all the observations “equally”, the loss function (hinge loss) that the SVM minimizes, defines it’s fit using the most ambiguous points (see the figure above with green and red planes, the points for green are the most ambiguous defining the maximum margin plane). It’s decision boundary is safe in that errors and the new observations are not likely to cause the SVM to mis-classify.

6. SVM origins: the perceptron algorithm

While discussing the SVM, it is very relevant to look at the perceptron algorithm, which is a linear function of weight/s ‘𝑤’ and predictors/features 𝑋 that assigns points to binary classes. If the function returns a value greater than 0, the observation is classified as 1, otherwise it is classified as zero.

Given below, ‘𝑓(𝑋)’ is the perceptron function for classification, 𝑏 (also represented by 𝑤_0 sometimes) is known as the “bias”, which is analogous to the intercept term.

If the points are linearly separable, then solving the perceptron is guaranteed to converge on a solution, but that solution is not necessarily optimal for future observations. This led to the invention of the SVM, which finds the optimal discriminator; the maximum margin hyperplane.

7. Finding the maximum margin

So far, we have learned about all the relevant concepts for SVM, and a valid question is how can we find the widest possible plane. We actually maximize the value of margin: 2/||𝑤||.

So, maximize the weights with constraints on what the function can output;

  • When the target is 1, the function needs to be 1 or greater.
  • When the target is 0 (or -1), the function needs to output -1 or lower.

Please note, our ‘𝑦’ is specified as 1/-1, as opposed to the 0/1 that we are used to. It is convenient for converting this to be a minimization. To make this a minimization, we can change the equation to be:

Which works out because when 𝑦=−1 the values become positive. (see the equations above)

8. The hinge loss and non-linearly separable cases

Consider the case when there is no line/plane that can separate all the observations perfectly, here, we need to introduce the capacity for model error. With the constraint “𝑦⋅(𝑤𝑇⋅𝑋+𝑏)≥1“ (also given above), we can introduce the hinge loss function:

Where as, for correct prediction/s, when:

  • 𝑓(𝑥_𝑖)>1: the point lies outside the margin and does not contribute to the loss.
  • 𝑓(𝑥_𝑖)=1: the point is on the margin and does not contribute to the loss.
  • 𝑓(𝑥_𝑖)<1: the point lies inside the margin and does contribute to the loss accordingly.
Figure 4: 0/1 vs Hinge loss. The figure shows how SVM impose penalties using hinge loss. The author has created this image and can be used under fair use policy. Reference is requested.

So, in general, whenever 𝑓(𝑥_𝑖) is NOT ≥1, the function (penalty) will be greater than zero and contribute to the loss accordingly for the respective datapoint/observation.

9. Hinge loss and “slack”

Suppose a situation when it is not possible to perfectly separate the classes, the hinge loss with a regularization parameter ‘𝐶’ is helpful:

𝜖_𝑖 is the errors from the algorithm/classifier, and 𝐶 is a regularization term, which determines how much the classification errors matter (relative to the maximization of the margin).

Now, the function that the SVM minimizes to find the boundary will be:

👉 A small value of 𝐶 creates a wider margin because errors will matter less.
👉 A large 𝐶 creates a tighter margin because errors matter more.
👉 An infinite 𝐶 parameter is a “hard” margin, which always minimizes error over the size of the boundary.

==> We are trying to minimize the norm of the weights, ||𝑤||, and thus maximize the margin, but now we are also trying to minimize our errors at the same time. While minimization, we are looking for a balance between how wide the margin should be and how much error we can tolerate.

10. 𝐶— The regularizing hyper-parameter

  • By setting the hyper-parameter 𝐶 we can control the extent to which our SVM is tolerant to misclassification errors. It is sometimes called the soft-margin constant.
  • 𝐶 affects the strength of the penalty, similar to the lambda terms in the Ridge and Lasso.
  • By multiplying the sum of the errors, which are the distances from the margins to the points inside of the margin, it allows the SVM to classify non-linearly separable problems by allowing errors to occur.
  • The lower the value of 𝐶, the more misclassified observations errors are allowed. These misclassified points are known as “slack variables”. Reducing the effect of errors on the loss function puts the emphasis on widening the margin.

If you are interested to explore the maths, section 7.1.1 Overlapping class distributions in Pattern Recognition And Machine Learning by Christopher Bishop would be greatly helpful Fig 7.3 is worth looking at to understand.

NOTE: In real life, most of the datasets are not linearly separable, and there will be no way to satisfy all the constraints in the equation to find maximum margin hyperplane. We can still handle such datasets, while learning a useful classifiers by introducing slack variables. This loosen and/or allow violation of certain constrains, i.e, we allow certain training points to be within the margin (a soft margin), however, we want this number to be as small as possible, and also want their penetration of the margin to be as small as possible. (we will talk about optimization along the way)

Figure 5: Figure shows misclassified, margin violation and point on the plan boundary along with their respective values of 𝜖_𝑖. The author has created this image and can be used under fair use policy. Reference is requested.

11. SVM in action — visualize the working

Now, we have enough understanding of support vector machines algorithm, let’s move and create interactive plots to see the effect of C parameter.

We need to import required libraries, create a dataset first.

# required imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC
from ipywidgets import interact
%matplotlib inline
#Retina display to see better quality images.
%config InlineBackend.figure_format = ‘retina’
# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)

11.1: Sample data

Let’s create a dataset for binary class classification.

We have two classes in the dataset — binary class classification problem.

So, now we have the data, let’s move on and write some functions to create the interactive plots and understand the effect of C parameter. A good programming practice!

We will be calling this function in the code below. Good to write modules/functions for code reusability!

11.2: Linear kernel

Copy the above code into your ow notebook, run it and change the values of C1 and C2 to see how the hyperplane is changing. — A good interactive visualization is always helpful to understand.

12. The “kernel trick” for non-linearly separable data

“kernel trick” is one of the big reason for the popularity of SVMs. This is done by “wrapping” our features (X) in a kernel function that transforms them into this higher dimensional space, which allows the algorithm to classify non-linearly separable data. (example figure below)

The idea behind the kernel trick is to arbitrarily transform the non-linearly separable observations into a different (higher) dimensional space, where they can have linear separability. We fit the algorithm in this new higher dimensional space and then invert the transformation of the data and the model itself back into the original space.

In summary, data may be linearly separable in higher dimensional space, and using kernel-trick, the classifiers can be learnt for high dimensional features spaces, without actually having to map the points into the high dimensional space.

The figure below should give the general intuition for what is happening in the kernel-trick.

A kernel Trick! The author has created this image and can be used under fair use policy. Reference is requested.

12.1: Polynomial kernel

Copy the above code into your ow notebook, run it and change the values of C & poly_degree slide bars to see how the hyperplane is changing. — A good interactive visualization is always helpful to understand.

12.2: RBF kernel

Copy the above code into your ow notebook, run it and change the values of C & gammas slide bars to see how the hyperplane is changing. — A good interactive visualization is always helpful to understand.

Takeaway: Bias/variance trade-off is handled via the hyper-parameter 𝐶

  • Large 𝐶 → narrow margin, less tolerant of misclassification, tends toward high variance
  • Small 𝐶 → wider margin, more tolerant of misclassification, tends toward high bias

13. Advantages and disadvantages

SVMs is widely used algorithm in business setup and have some advantages and disadvantages.

Advantages:

  • Robust to outliers
  • Effective in high dimensional data
  • Can work with non-linearities
  • Fast to compute even on non-linear (kernel trick)
  • Low risk of over-fitting
  • Exceptional performance

Disadvantages:

  • Black-box
  • Can be slow on large datasets

14. SVM vs Logistic Regression — when to use

Recall “no free lunch theorem”, when to use which algorithm is a tricky questions, however according to the expert advise:

  • If there are more feature than training samples: Use logistic regression or SVM without a kernel (linear kernel)
  • If there are about 10 times as many samples as features: Use SVM with a Gaussian kernel
  • If there are many more training samples than features: Spend time feature engineering, then use logistic regression or SVM without a kernel (linear kernel)

15) New terms

15.1: Slack Variables:

In optimization, particularly in linear programming problems, variables that are introduced for inequalities constraints so that it can be transformed into equality. An equation will introduce one slack variable, e.g:

15.2: Norm:

Norm is a function that takes our vector space (real or complex) as an input and returns a non-negative real number telling how big our vector is, e.g. distance traveled etc. A good read

16. Additional resources

🧘🏻‍♂️ Nice lectures slides by Prof. Andrew Zisserman, Oxford University, UK.

🧘🏻‍♂️ A great resource from Stanford University on Support vector machines and machine learning on documents

🧘🏻‍♂️ Nice blog posts series on SVMs:

🧘🏻‍♂️ SVMs Tutorial slides by Jason Weston, NEC Labs America

🧘🏻‍♂️ Lecture slides by R. Berwick from University of Central Florida: An Idiot’s guide to Support vector machines (SVMs)

🧘🏻‍♂️ Support Vector machines lecture notes from University of Toronto, Canada.

🧘🏻‍♂️ A paper by Prof. Thorsten Joachims on Text Categorization with Support Vector Machines: Learning with Many Features

🧘🏻‍♂️ An advanced discussion of SVMs as probabilistic models

🧘🏻‍♂️ A Tutorial on Support Vector Machines for Pattern Recognition

🧘🏻‍♂️ on scikit-learn:

🧘🏻‍♂️ Some nice video lectures:

🧘🏻‍♂️ Some links on loss functions to refresh:

Image Source ([Fair Use]): [What are the main reasons not to use MSE as a cost function for Logistic Regression?]

So, this was all about on Support Vector Machines (SVMs) at the moment!

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on “A43: Support Vector Machines (SVMs) — Hands-on [complete project with code]”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.