A43: Support Vector Machines (SVMs) — Hands-on [complete project with code]

Feature selection, chi-square, ANOVA, Grid-search, Random Search, feature scaling, ROC curve, Predicting Breast Cancer

Junaid Qazi, PhD
20 min readMar 13, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A42: Support Vector Machines (SVMs) [Behind The Scene!]!!”

💐Click here to FOLLOW ME for new contents💐

⚠️ We will be working with Breast Cancer Wisconsin (Diagnostic) dataset for learning purpose in this lecture.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯 >> Stay calm and focused! >> 🧘🏻‍♂️ 👉🎯

Support Vector Machines (SVMs) — Hands-on

  1. The-dataset
  2. Exploratory-Data-Analysis-EDA
  3. Feature-Selection
    3.1: chi2
    3.2: ANOVA-Analysis-of-Variance-F-value
    3.3: Simple-pairwise-correlation
    3.4: Correlation-heatmap-of-selected-features
  4. Machine Learning
    4.1: Support-Vector-Classifier-Importing-and-training
    4.2: Predictions and Evaluation
    4.3: Grid-Search
    4.4: Predictions and Evaluation — GridSearch
    4.5: Feature-Scaling
    4.6: Model-re-training-and-evaluation-using-scaled-features
  5. ROC-Curve-Final-model
  6. Saving-the-model
  7. To Do

1. The dataset

After a comprehensive overview on working principle of SVMs, its time to learn by doing. So, welcome to the hands-on section. In this lecture, we will be working with a real dataset on Breast Cancer Wisconsin (Diagnostic). This dataset is available on kaggle and originally belong to UCI Machine Learning Repository. If you want to know more about the dataset, please click here for the Relevant Papers and detailed description.

At this stage of the course, I am sure that you guys are feeling very comfortable with writing functions, doing data analysis and training machine learning algorithms. For thi project, we are given a data file (.csv) and another file (.txt)which contains features names. This is a common practice when you have large number of feature in your data. We can write a custom function to read features from the given .txt file and the data from .csv. file and name data columns accordingly.

First thing first, let’s import the required libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set(font_scale=1.3) # setting font size for the whole notebook
sns.set_style(“white”) # if you want to set the style
# Setting display format to retina in matplotlib to see better quality images.
from IPython.display import set_matplotlib_formats
set_matplotlib_formats(‘retina’)
# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)

Having said, we have .txt file containing list of features, let's write a function that can read header (the columns) for our dataset from a separate file. In our case, the header is a list of feature columns .text file.

Lets read the data and the feature names from GitHub!

# Data and feature names links on gitdata_link=’’’https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/Breast_Cancer_data_no_feature_names.csv'''feature_names_link=’’’https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/features_names_Breast_cancer.txt'''#Loading header file 
cols = read_header(feature_names_link)
# passing column names from cols
df = pd.read_csv(data_link, names = cols)
# You can check the head and info (try yourself), I am just avoiding here because we have 32 feature and it will be bigger output.
# df.head()
# df.info()

From info, you will notice that the dataset have 32 features and 569 datapoints. There is no missing data!

It’s a good idea to get summary statistics using describe().

From the above summary statistics, we notice that the texture_mean got ~ 654 mean, ~ 143 min and 2501 max value!

We can look at other features, their std, max, min and means to get some idea on their distributions! (Recall your high school statistics)

==> Note: Instead of using all features, we can consider ten real-valued features in this project for diagnostic. They are commonly used in the studies. Anyhow, we are not going to do this, and if you want, here is the code to separate them:

df = df[['radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean', 'concavity_mean', 'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean','diagnosis']]

2. Exploratory Data Analysis (EDA)

Well, we know the importance of exploratory data analysis.

Starting with value_counts(), let's see how many instances/datapoints/observations we have in each target class.

We have two classes (B for benign tumor and M for malignant tumor, which is cancer).

Let’s add a new column target(0/1 -- cancer no/yes instead of B/M). This 0/1 column would be helpful in getting ROC Curve as well.

It might be a good idea to see how mean and median smoothness for two type of cancer varies!

==> Can we use groupby() for the above task/s? (TRY!)

<<Smoothness>>

The numbers for each target class look different and visualizations are always helpful, let’s get plots for smoothness and put mean and median values on the plots as well.!

Notice, the scales along y are different for each plot. Mean (red lines) of type M is much higher than type B. Same is true for median values (green lines)!

(Notice the scale difference along y-axis in the above plots)

From the above plots, we can see that:

  • type "M" tumor has higher smoothness .
  • red lines show mean smoothness whereas the green lines show the medians.
  • for type M, few points have significantly bigger values, potential outliers.

Another thing, it is good idea to see the standard deviations (std) as well, we can find std values for each measure in summary statistics as well (see above in describe())

<<Compactness>>

Moving forward, let’s see, how the mean and median compactness for two type of tumor differ.

Let’s copy the above code to create plots for compactness and put mean and median values as legends

(Try yourself >> suggestion, write a function to get the plot and re-use instead of writing complete code again)

Notice the y scale, it is same for both plots. Similar to the smoothness, the compactness is also different in both types “M” and “B”!

Both plots are on the same scale and we can see that type "M" have higher compactness mean as compared to type "B".

Well, you can try several other plots to understand your data. Spend sometime and learn more.

Before we move forward, it might be good idea to drop all the error columns from our dataframe. They may not be very helpful (you can explore them later and see their impact). Let’s get a list of all error columns, we can look for the column name to see if they have the word ‘error’ using for loop and make a selection, right?

So, we have list of columns "cols_", that does not include the error columns. we can separate them now.

df = df[cols_] # separating columns we want to work with

3. Feature Selection

It’s time to explore little more than what we have been doing in the previous lectures. Feature selection is one of the important process, especially when we are working with large number of features in our dataset. Let me introduce you to some common ways to select features based on statistical measures.

Before we move on, we need to do some imports; SelectKBest() which will return requested number of top features based on suggested statistics such as chi2 or ANOVA F-Value.

Please refresh your basic knowledge on statistics from any fundamental book on statistics, wikipedia is also a great source for this purpose.

3.1: chi2

A quick review!

The Chi-Square statistic is widely used for testing relationships between categorical variables/features. In chi-statistics, the null-hypothesis states that there is no relationship on the categorical variables in the population, means they are independent.
Using Chi-Square statistics, we can ask a question such as;

  • Is there a significant relationship between voter intent and political party membership?

In scikit-learn, chi2 computes Chi-Squared statistics between each non-negative feature and class.

Want to refresh, here is a Good link on Chi-square statistics
This link is very helpful to brush-up your understanding on Hypothesis-testing

Let’s move on and get the top 10 features using SelectKBest() with chi2 statistics.

So, features_selected_chi is an object, somewhere in the memory at 0x.......... location. We can call get_support() on this object to get a mask for our k-best feature based chi-square statistics in this case.

We can use k_best_feature_mask to get names of best features. In our case, we have 10 features as we passed k = 10. (Recall boolean masking from python essentials)

3.2: ANOVA (Analysis of Variance) F-value

A quick review!

ANOVA is a tool based on collection of statistical models and estimation procedures related to those models. Typically, it is used to splits an observed aggregate variability found inside a dataset into two parts: systematic factors and random factors.

Systematic factors have a statistical influence on the given dataset, whereas Random factors don’t. One can use ANOVA to explore the influence of independent variables on the dependent variable.

The F-value in one way ANOVA helps to assess whether the expected values of a quantitative variable within several pre-defined groups differ from each other.

In a typical example, ANOVA F-value (F-test) can be used to determine whether any of the treatments is on average superior or inferior (rejecting null-hypothesis) while comparing the medical trials of four treatments, the null-hypotheses is that all four treatments yields the same mean response.

Another question that we can answer; “Is the variance between the means of two populations significantly different?”

In scikit-learn f_classif computes the ANOVA F-value for the provided sample.

Need a re-fresh on ANOVA? this link could be a good read and This one is good as well

Now, let’s move on and select 10 best features based on ANOVA F-value. To do so, we only need to change the parameter score_func to f_classif in SelectKBest module.

f_classif is a default selection in SelectKBest() module and only works with classification tasks. (<shift+tab> to confirm from the docstring, they sometimes do the changes in newer versions.

3.3: Simple pairwise correlation

This is one of the most common measure that can be simply computed using pandas .corr() method (compute pairwise correlation of columns, excluding NA/null values). The default method of calculation is the well known "Pearson Correlation".

Here is a good link to refresh Pearson correlation
This one provide overview on Pearson, Kendall and Spearman

Let’s write a single line of code to do the following steps:

  • dropping ID and diagnosis columns
  • computing correlation — default is Pearson — <shift+tab> for docstring to explore
  • grabbing target to see how the variables are correlated with target
  • sorting the values
  • grabbing top 10 using slicing operation on index (recall you python essentials section)
  • converting into a list (using tolist() here instead of passing to the list() method - good to know different ways)

I suggest you to break it into the steps and understand the process, it’s good for your learning.

Well, if we look at top 10 features from the above three statistical measures, more of less, they return similar features in different order. What I am going to do, grabbing all union of all in a single list.

Are you thinking about writing a function for union operation? Well, you can do so for your practice, but I do have a claver way of doing this at the moment! (Recall your skills from Python Essentials) We can concatenate all three lists of features using + operator, pass it to set() and then list() for this purpose! right?

3.4: Correlation heatmap of selected features

What do you think, is this a good idea to get correlation heatmap for seleted_features along with target now? Let's see how it looks like!

We don’t have target in selected_features, its good idea to check this in python way (you could have a long list of features and it is not easy to read them all!) Once again, your skills from the previous lectures are useful!

‘target’ in selected_features # should return True is target is in the list!

So, we need to append target to "selected_features" and re-run the above line to re-confirm!

All done for the heatmap, let use seaborn here!

The above heatmap looks nice. We have lot’s of information in it. Along diagonal, all values are 1 because correlation for any feature to itself will be 1. Heatmap shows the how the selected features are correlated to each others and to the target.

(OPTIONAL) We can use pairplot from seaborn to get the overall picture, that would be similar information (as in the heatmap) in different type of plots (scatter and kde). Here is the code for a pairplot for the selected features, in case you want to try! sns.pairplot(df[selected_features], hue = 'target')

EDA is important and we explore the data in different way to understand all possible aspects, try yourself to understand the data as much as possible. At the moment, we are moving forward to train our SVM classifier.

4. Machine learning modelling

In the features selection, we have identified top 10 features using three different statistical measures. However, I am going to use all real valued features to train our machine learning model. You can try subset using chi2 or ANOVA and see the difference.

We have already separated features and targets in variables with same names. Instead of typical X and y, we can use them here.

Train Test Split: I am sure, this is on your finger tips now!

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.33, random_state=101)

4.1: Support Vector Classifier — Importing and training

Few steps that you can lead me through now!

  • Importing classifier,
  • creating instance,
  • train (fit) the model on training data (X_train, y_train)
  • do the predictions for test data (X_test, y_test)
  • evaluation, right?

Let’s do this step by step. Note, SVC (support vector classifier) is in the svm class!

If you explore the documentation of SVC, we will see that there are number of parameters that we have learned such as C, degree, gamma kernel etc. We will come back and look at them in a while. We will tune the hyper-parameters!

4.2: Predictions and Evaluation

# Guess what, time to do the predictions!
svm_pred = svm_model.predict(X_test)

Required imports. Once you get familiar with the process, a good idea is to do all imports at the beginning!

from sklearn.metrics import classification_report,confusion_matrix

Presentation could be improved, it’s always good. Let’s try for the confusion matrix.

So, the above results are not good, our model is predicting all false and is not able to predict the Trues! SVM should do much better, even better than logistic regression and KNN in most of the cases.
Let’s try Grid-Search and see if we can get improved model.

Now, another thing we need to explore is to find the best value of C and gamma parameters. We have seen their effect on some random dataset in the theory lecture. We need to run the grid search.

4.3: Grid-Search

Recall from the previous lectures: We did both Randomized-Search and Grid-Search in Random Forests, I am going to do Grid Search here, you can try Randomized yourself, should be very simple now!

Although, it is tricky but very important to find the right parameter so that the model work at its full. Grid-Search in one of the common way by creating a "grid" of parameters and try all the possible combinations to see which one works the best. Scikit-learn has a built-in capability to implement Grid-Search with GridSearchCV...We have already seen , it's great and simple, right :)

Important members of GridSearchCV are fit and predict. GridSearchCV takes a model instance and grid of the parameters, which is defined as a dictionary. In the dictionary, keys are the parameters names and the values are the settings to be tested.

We have already discussed the most of the parameters and their importance in SVMs theory lecture. A quick overview on C and gamma is:

>> The C parameter trades off misclassification of training examples against simplicity of the decision surface. A low C makes the decision surface smooth, while a high C aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors.

>> The gamma parameter defines how far the influence of a single training example reaches, with low values meaning ‘far’ and high values meaning ‘close’. The gamma parameters can be seen as the inverse of the radius of influence of samples selected by the model as support vectors.

The behavior of the model is very sensitive to the gamma parameter. If gamma is too large, the radius of the area of influence of the support vectors only includes the support vector itself and no amount of regularization with C will be able to prevent overfitting.

When gamma is very small, the model is too constrained and cannot capture the complexity or “shape” of the data. The region of influence of any selected support vector would include the whole training set. The resulting model will behave similarly to a linear model with a set of hyperplanes that separate the centers of high density of any pair of two classes.

Want to know more on the C and gamma parameters, see the official documentation at scikit-learn with examples

==> Please go though the recommended reading to understand the mathematical concepts behind these parameters.
If you are interested in more practical use of Support Vector Machines (SVM), you can take a message that C and gamma parameters, you can adjust using the Gridsearch.

Ok, we need to find the best value of C and gamma. Let’s create a param_grid to run the Grid-Search, along with C and gamma, I am going to add kernel as well.

Let’s create an instance and pass the following parameters:

  • estimator : estimator object SVC() in our case
  • param_grid : dictionary or list of dictionaries. Dictionary with parameters names (string) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.
  • verbose : integer — Controls the verbosity: the higher, the more messages. If you don’t pass any value to this, you will not see any message. Just to see if the Grid-Search process is working, its always good idea to see some output, hence good to pass some small number! Depending upon the number of parameters and their values, Grid-Search takes long time.

==> Another important thing to know about GridSearchCV. It is a meta-estimator. It takes an estimator like SVC(), and creates a new estimator, that behaves exactly the same - in this case, grid will behave same as SVC() classifier.

Just a recall on gridsearch

  • As grid will behave same as SVC(), so, just like any other model, let's call fit on grid an pass in the training data.

The fit in this case does little more than the usual fit in any other model. First, the fit find the best parameter combination by running same loop with cross-validation. After getting the best parameters, it runs fit again on all data passed to fit (without cross-validation) and built a single new model using the best parameter setting.

Note: The default cross validation, cv, is 5 (5 folds). I have used verbose=1 and did not get much text in the output, try 2 or 3 and see the difference! I actually did not want to populate the entire notebook!

With the gridsearch, we have found the best combination of parameters in the object grid, let see what is that combination and the best score.

We can always explore other attributes of our trained model in grid. Let's see how the best estimator look like.

4.4: Predictions and Evaluation — GridSearch

We do the predictions in the same way as other models!

grid_pred = grid.predict(X_test)

And now the confusion matrix and the classification report and !

So, the Grid-Search was very helpful, the model has shown improvements.

  • What else we can do, can we further improve the predictions?
  • Can we reduce the computation time for Grid-Search?

4.5: Feature Scaling

What we can do, we can look at the summary statistics of our data once again and see how different the scales are for our features. Quickest way is to grab mean, min and max values of all the features and see how their range is varying.

Let’s use describe() on features dataframe along with transpose and grab the required columns for the plot below to show max value of each feature. (Notice the log scale along y-axis)

From the above bar plot, we can see that there is significant variation in the range of the features. Some values are sufficiently larger than others. We know the importance of feature scaling and have seen the improvements in KNN lecture.

Let’s get the scaled features and re-train our SVM model. (Code reference: KNN lecture)

I am going to put couple of steps in a single cell, must be easier for you at this stage.

4.6: Model re-training and evaluation using scaled features

Let’s split the data, train the model, get the predictions and print the confusion matrix, all in one cell of code.

X_train, X_test, y_train, y_test = train_test_split(scaled_features, target, test_size=0.33, random_state=101)
grid_pred = grid.predict(X_test)

Excellent, the performance of our model is further improved, and another thing you want to notice, the computation time for the Grid-Search is much less for scaled features as compared to the unscaled once. Think, what if you are working with large number of features and with 10 or even 100 times more data points? You want to save time!

A rule of thumb is, scale the features regardless of which model you are working with. If they don’t have effects on the model, you will get the same results, if they model is sensitive to scaling, you will get improved results. In both the cases, you will significantly reduce the computation time.

5. ROC Curve — Final model

I hope you can understand the code below at this stage!

So, looks like the scaled features are giving us the best results. This is out final model at the moment. Let’s plot ROC curve and later on save the final model for use.

# Required imports from scikit-learn 
from sklearn.metrics import roc_curve, roc_auc_score
# Area Under the ROC Curve
grid_auc = roc_auc_score(y_test, grid.predict_proba(X_test)[:,1])
# setting the figure size
plt.figure(figsize = (18, 6))
# Computing Receiver operating characteristic (ROC)
fpr_grid, tpr_grid, thresholds_grid = roc_curve(y_test, grid.predict_proba(X_test)[:,1])
# plot no skill — A line for random guess
plt.plot([0, 1], [0, 1], linestyle=’ — ‘, label = ‘Random guess’ )

# ploting ROC Curve for skilled svm model
plt.plot(fpr_grid, tpr_grid, marker=’.’, label = ‘ROC — AUC — Grid-Search SVC: %.3f’ % grid_auc)
# good to put title and labels
plt.title(‘SVM results after Grid-Search on scaled features’)
plt.ylabel(‘True Positive Rate’)
plt.xlabel(‘False Positive Rate’)

# putting the legends
plt.legend();
ROC Curve looks great, the area under the curve is 0.986, so the model is skilled!

During this SVM project, we have discussed lot’s of important concepts along with the ways to improve the performance of our model and to make computation efficient. You can imagine the importance of Grid-Search, you may not be able to get an idea about the best values of your parameter (e.g. C and gamma here) without Grid-Search process. Keep it in your mind that the Grid-Search can take long time, especially for the large data sets and bunch parameters. It also depends upon your computer.

The best way in the real life data projects is, to set the things up after cleaning your dataset. Do the Grid-Search on smaller set of grids (e.g. 1, 2 or 3) to make sure that everything is working correctly. Once you see things are working, then run it on full set of parameters with as many value as you want in the list for each of your parameters. In this way, you will save time as you know it worked on small set and will possibly work on the full set of parameters. In the mean time, you can do something different!

6: Saving the model

Same stuff, you can definitely save and load the model now!

Notice the difference, we are saving grid searched model here!

Well, rather than saving the final model from GridSearched object, the appropriate way is to get the set of best parameters and save them (e.g. as a text/csv file), use the saved set of optimized features and train your final model on the full dataset. Once trained, save is for deployment purpose.

All done at the moment for this notebook, can you think about improving your model further?

7: To Do

  • Use subset of features that we have identified using chi2 and/or ANOVA and train your model. Compare your results.
  • Repeat your model with Recursive Feature Elimination (RFE) from scikit-learn. It’s super easy to implement. RFE select features by recursively considering smaller and smaller sets.
  • First, the model is trained on the initial set of features and the importance of each feature is obtained either through a coef_ attribute or through a feature_importances_ attribute. Then, the least important features are pruned from current set of features. That procedure is recursively repeated on the pruned set until the desired number of features to select is eventually reached.
  • Repeat all of your algorithms that you have learned so far in the previous lectures, after finding best subset of features using different ways, can you improve the performance of your final trained models?
  • Remember, the goal is to finalize well generalized and skillful model that can efficiently predict the unseen data.
  • Re-train all the models that you have learned in the previous lectures using scaled features, create a summary report on your learning from this practice.

Now you can think, where do we spend all of our time, it’s not writing lots and lots of code at the end, its matter of finding the best available option!

Good luck and Keep practicing!

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on A44: Support Vector Machines (SVMs) vs Logistic Regression — Practice & Comparisons [complete project with code]”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.