A37: Importance of feature scaling in KNN — hands-on implementation
K-Nearest Neighbors and Feature scaling/standardization, Elbow method, Coded data.
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
⚠️ We will be using synthetically created coded dataset in this project.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
🧘🏻♂️ 👉🎯
Hello guys,
Welcome back to this lecture notes series on Data Science from Scratch. In the previous lectures, we learned about the working of KNN. KNN mainly store all the data and work on computing distances, right? Now, what could be the effect on distance matrix if the features are on very different scales? Well, let’s explore this and start with creating some situation for out project.
So the project is >> We are contacted by a client and their data is highly confidential. Because of the privacy, our client does not want to disclose names of the features, and they are already coded for confidentiality (client’s own preference). Data is growing and our client wants to develop machine learning algorithm that can automate their key business process and facilitate in decisions.
So, we are given a task to use the coded data and develop a machine learning model, 'Results'
is the target column in the data and it is a classification problem. We want to use KNN
algorithm. Great, let’s see how it works.
First thing first, let’s imports the required libraries!
In this article, we will mainly explore the effects of feature scaling in KNN. We will also learn about the methods to find the best value of k number of neighbors. This article is arranged in following points!
1. The dataset and exploratory data analysis
2. Baseline accuracy
3. Model training on unscaled data
>>3.1: Predictions and evaluations — unscaled data
4. Effect of feature scaling on KNN
>>4.1: Saving scaling transformation
>>4.2: OPTIONAL — DataFrame for scaled features
5. Model training using scaled features
>>5.1: Predictions and evaluations — scaled features
6. Elbow method to chose the k value
>>6.1: Plotting accuracy — alternative way to find k
7. Saving and loading the trained model — Same old story
8. ROC curve — model comparisons
*******************************************************************
1. The Dataset and exploratory data analysis
Let’s read the dataset and do some exploratory data analysis.
Notice theUnnamed: 0
column, it is an index stored in the given data file and we know this from the data description. Let's pass index_col = 0
while reading the data.
A recall >> pandas automatically create index if it is not provided while reading the data.
set_index()
is a useful function if we want to set any column as index after reading the data.
Looks good now, let’s have a quick overview of the data using info()
.
There is no missing data, all features are numeric. We can call describe() on our dataframe to get statistical summary!
From the above summary statistics, we can see the features (Cd_1
, Cd_5
, Cd_6,
Cd_9....)
are on very different scales. The columns are coded, and we even don't know what are they representing!
Let’s grab the mean and standard deviations (std) of all features and plot them to see how they look like. We can call Transpose
on describe()
and grab mean and standard deviation only.
Visualizations are more eye friendly, we can get a bar plot for the above statistical measures.
I am going to use log-scale along y for means.
From the above bar plot, we can clearly see significant variations in the scales. Notice, Cd_1 vs Cd_9!
Moving forwards, let’s get a pairplot and quickely see how the data looks like.
We can do more visualization using range of plotting options to understand the data. However, in this lecture, machine learning is the focus, so we can give a quick overview using pariplot and move on to the KNN algorithm.
sns.pairplot(df, hue=’Result’);
The above pairplot
may not be very useful, it is very crowded! However, it provides a quick overview on complete dataset. Look at the distributions, see if you can find any correlation, features that have more predictive power and separating the class better than others......spend some time, knowing your data is a key!
Well, we can also explore the features individually or in smaller subsets. It’s time consuming, but very important to understand the data. Try more plots to improve your understanding about your data. (To Do)
*******************************************************************
2. Baseline accuracy
Let’s look at the class distribution and find out the baseline accuracy.
So, we have equal number of instances of each class. In the extreme case, if we have a model that predict all Y
or all N
, the accuracy (baseline) of the model will be 0.5 (50%), equal chance. We want a model that can perform much better than the baseline.
Let’s move on and separate features and the target in (X, y)
, and then split the data into train (X_train,y_train)
and test (X_test, y_test)
sets using train_test split()
.
==>
Keep the test_size
and random_state
same if you want to have same results as given in this notebook!
*******************************************************************
3. Model training on unscaled data
Our focus is to come up with a model that can predict the class in Result
column for any unseen data point. For KNN algorithm, k - number of neighbors
is important. However, at the moment, we don't know what is the best number to use. Well, we can start with k = 3
(you can try 1, 2 or any number), we will try to optimize the value of k later on. (Another important thing to note, we have not scaled the features and using them as-is!)
We need to import the KNeighborsClassifier
from sklearn.neighbors
.
from sklearn.neighbors import KNeighborsClassifier
Next, we need to create a KNN model
instance with n_neighbors=3
(our choice).
# creating model instance “knn”
n_neighbors=3
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
#shift+tab to see other available parameters, we will only use n_neighbors
Let’s fit the model instance, "knn"
, to the training dataset.
3.1: Predictions and evaluations — unscaled data
Once the model is fitted on training dataset, we can get predictions for the test part of the data using our trained model.
predictions = knn.predict(X_test)
Evaluation is very important to know if our model is working good or not! Let’s see how the confusion matrix and classification report look like!
We have already computed the baseline accuracy, which is 50% chance for any class. The model we have developed is not good at all. It’s accuracy on the test data is even below the baseline.
👉 Recall the theory of KNN, it is distance base algorithm and feature scaling is very important step that we skipped intentionally. Let’s see if we can improve the model performance with feature scaling!
*******************************************************************
4. Effect of feature scaling on KNN
KNN makes decision by identifying votes from the training datapoints that are nearest to the test datapoint (majority voting based on the k value). In such situation, scale of the features does matter. Variables with larger scale will have larger effect on the distance between observations and also on KNN classifier as compare to the variables on smaller scale.
We need to do feature scaling first, so that they are on same scale to deal with this issue!
We also know that Scikit-learn
has built-in functionality to do feature scaling. We need to import StandardScaler
from Scikit-learn
and also need to create a StandardScaler()
object (e.g. scaler
). (try MinMaxScaler yourself and compare your results)
Let’s do this!
This link on StackExchange has nice plots that may help you visually understand the scaling process using very simple example.
Let’s split the data into new variables (features & target) and fit the scaler
instance to the features only!
Let’s fit scaler
to the features now!
So, the object scaler
is now a fitted object on the features. We can use this scaler
object to transform all features in the dataset. Scikit-learn
provides .transform()
method to do standardization
job by centring and scaling the features.
But wait, before we move on, I want to introduce another important and simple thing here — saving the transformation.
4.1: Saving scaling transformation
As we have already discussed in our previous lectures, this is a good practice to save/serialize the scaler transformation, and use it when we need it for the new data. Let’s accomplish this task of saving scaler transformation using joblib
here, you can use pickle
(the way we used previously) module as well, however, it's good to know little more!
Once, we have saved/serialize the transformation, we can load it for unknown/test dataset and get the transformed features for the test data.
Ok, now pass the features to scaler.transform()
to get standardized features in a new variable scaled_features
!
4.2: OPTIONAL — DataFrame for scaled features
(Its easier to work with pandas, so we can just create a dataframe for scaled features)
scaled_features
object is a NumPy array, let's convert this to pandas DataFrame
! We can use our df.columns
to get the column names and pass that to DatFrame()
along with scaled_features
, we don't need Results
column, and [:-1] will work ([:-1] means everything but the last one)!
Creating dataframe for the scaled_features.
Our standardized/scaled data is ready for Machine Learning again!
Let’s split the data in train and test parts again. We can either use our newly created dataframe df_scaled_features
or numpy array scaled_features
. Let's try the numpy array scaled_features
for now.
*******************************************************************
5. Model training using scaled features
Let’s create new instance with different name, fit on the training dataset and do the prediction in a single cell. If you want you can use the old name knn
as well, however it will be overwritten. For direct comparisons to see the effect of scaling, we will keep n_neighbors = 3
, same as without scaling.
5.1: Predictions and evaluations — scaled features
Just see the accuracy, significant improvements using scaled features. We can clearly see the importance of scaled features in distance based algorithms such as KNN.
Well, we are not done yet, there is another thing that we need to check and see if we can further improve the performance of our classifier or model. (What about the k value? are we happy with the number we just selected randomly!)
🧘🏻♂️Please note that finding k is not trivial as for a small value of k, the effect of noise will be higher and on the other hand, much higher value will be computationally expensive. Typically, odd value is selected and the value above the
sqrt(n_observations)
is avoided, however it is not a standard.🧘🏻♂️
*******************************************************************
6. Elbow method to chose the k value
Let’s explore if we can improve the model using some better value of k. To do this, we can use elbow method to find out a good value for k. In elbow method, we actually iterate a range of KNN
models using different k values
.
Let’s try using odd value of k
!
==>
We get the error rate for every k
value and plot this error rate against k
to find out its optimum value. (reference: KNN theory lecture)
Now, we have all the values of error rates in a list err_rate
. Visualizations would be helpful here to figure out which value of k got the lowest error rate.
6.1: Plotting accuracy — alternative way to find k
Alternate to the error rate, we can compute accuracy score for each value of k and get the plot below, in both cases, we look for the point for the sharpest change.
Looks like, k=9
is where we get elbow point (sharp bend). We can select k = 9
and see if our model improves.
So, we can see further improvements in our model. It was worth finding the optimum value for k using elbow method.
*******************************************************************
7. Saving and loading the trained model — Same old story
I hope, you can save and load the saved model now!, try it your self before you see the code below!
See linear and logistic regression lectures and copy the code from those notebooks!
We need to save our best model, which is actually trained on scaled features in this case.
*******************************************************************
8. ROC curve — model comparisons
We have y_test
as Y/N
, let's get 1 for Y
and 0 for N
. We need binary targets 1/0
for ROC curve.
Let’s create ROC curve, the code below should be easier to understand and follow at this stage!
# I hope you can understand the code below at this stage! # setting the figure size
plt.figure(figsize = (18,8))# Plot NO SKILL — A line for random guess
plt.plot([0,1],[0,1],linestyle=’ — ‘,label=’Random guess’,lw=’8')# For AS-IS data
fpr_knn, tpr_knn, thresholds_knn = roc_curve(y_test_b, knn.predict_proba(X_test)[:,1])
plt.plot(fpr_knn, tpr_knn,marker=’.’, label=’ROC-AUC-KNN AS-IS data:%.3f’%knn_auc)# For SCALED data
fpr_knn_scaled, tpr_knn_scaled, thresholds_knn_scaled = roc_curve(y_test_b, knn_scaled.predict_proba(X_test)[:,1])
plt.plot(fpr_knn_scaled, tpr_knn_scaled,marker=’.’, label=’ROC-AUC-KNN Scaled data:%.3f’%knn_scaled_auc)# At ELBOW point with SCALED data
fpr_knn_elbow, tpr_knn_elbow, thresholds_knn_elbow = roc_curve(y_test_b, knn_elbow.predict_proba(X_test)[:,1])
plt.plot(fpr_knn_elbow,tpr_knn_elbow,marker=’.’, label=’ROC-AUC-KNN Elbow-Scaled data:%.3f’%knn_elbow_auc)# Let’s set the limits, and put title, labels, legends …etc
plt.xlim([0, 1])
plt.ylim([0, 1.01])
plt.title(‘ROC for KNN (AS-IS data, Scaled data, Sscaled data with Elobow point’)
plt.ylabel(‘True Positive Rate’)
plt.xlabel(‘False Positive Rate’)
plt.legend();
This was all about KNN at the moment. We will do a quick comparison between logistic regression and KNN in the next lecture using breast cancer data!
*******************************************************************
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A38: Logistic regression vs KNN!”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
**************************************************************************************************************************************
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.
**************************************************************************************************************************************