A36: K-Nearest Neighbors (KNN) — Understand With Hands-on Code!

Euclidean, Manhattan & Minkowski distance, Breaking ties in KNN, Parametric & nonparametric models, Advantages & disadvantages

Junaid Qazi, PhD
10 min readJan 30, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A35: K-Nearest-Neighbors (KNN)-Behind The Scene!”

💐Click here to FOLLOW ME for new contents💐

⚠️ We will be doing hands-on coding to understand “in & out” of KNN model, talk about ties, advantages and disadvantages ….. and much more! 🧘🏻‍♂️Doing is Learning.🧘🏻‍♂️

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯

Welcome back guys, let’s explore little more and learn KNN by doing!

So far, we have learned how regression (linear and logistic) algorithms can be used in supervised machine learning.

In the previous lectures, we have explored the evaluation methods for both linear regression (continuous target) and logistic regression (class prediction). In a classification problem, we are generally concerned about identifying anything incorrectly. We want our algorithm to predict correctly as much as possible on the test (unseen) dataset for generalization. KNN is another very simple and widely used algorithm, typically for classification, however, it can be used for regression tasks as well.

1. KNN review and distance function

  • 1.1: Euclidean distance
  • 1.2: Manhattan distance
  • 1.3: Minkowski distance

2. Visualize KNN working — hands-on

  • 2.1: Visualize training and the test data
  • 2.2: Computing distances from test point
  • 2.3: Plotting distances and selected k value

3. Advantages and disadvantages

4. Readings

5. A note on parametric and nonparametric models — good to know

6. Breaking ties

1. KNN review and distance functions

As discusses in the slides, KNN considers how many observations belong to a certain class with in the selected k (number of neighbors) value, and make a decision from there, based on more votes for a test data class. The algorithm stores all available data points and compute their distances from the test data for its classification task. It is also important to remember that, physical units of the features are very important in this case and they all have to be on similar scale. Feature scaling (standardization or normalization) can drastically improve the model performance.

🧘🏻‍♂️Distance functions for KNN🧘🏻‍♂️

Typically KNN algorithm uses Euclidean or Manhattan distance functions. It is rare that other distance metrics are used for computing the distances. We can also create our own distance function to use in the algorithm (depends upon our custom needs).

Few common distance functions for KNN:

1.1: Euclidean distance

Euclidean is a straight line distance between two points. This is the most common choice and can be calculated using the Pythagorean theorem from the Cartesian coordinates between two points. Sometimes, the Euclidean distance is also called the Pythagorean distance.

The figure below (on left) describing Pythagorean theorem, and (on right) computing Euclidean distance between two points in two dimensional Euclidean space.

1.2: Manhattan distance

Manhattan is a distance between two points measured along axes at right angles.

On the left, in a unit square/s grid, black lines show the Manhattan distances. Any number of lines along the grid will give equal distance (10 units) between 𝑎 and 𝑏. The Euclidean distance (5×√2 ∼7.07 𝑢𝑛𝑖𝑡𝑠)

is shown with green line between the points. The real maps between two different points from Time Square, Manhattan, NYC are shown, all the distances are equal in individual map.

In a plane with “𝑎” at (𝑎1,𝑎2) and “𝑏” at (𝑏1,𝑏2), the Manhattan distance between “𝑎” and “𝑏” will be |𝑎1−𝑏1|+|𝑎2−𝑏2|

1.3: Minkowski distance

Minkowski distance can be considered as a generalization between the Euclidean distance and the Manhattan distances and between two points (𝑎,𝑏), it is defined as:

𝑝 is typically set to a value between 1 and 2, however it can be any real value. (Please note, for values of 𝑝<1, the formula above does not define a valid distance metric since the triangle inequality is not satisfied.)

Let’s move and practically do the stuff.

As always, we need to import some basic libraries…!

import pandas as pd;
import numpy as np
import matplotlib.pyplot as plt;
import seaborn as sns
sns.set_style(‘whitegrid’) # just optional!
%matplotlib inline
sns.set(font_scale=1.5) # setting font size for the whole notebook
sns.set_style(“whitegrid”) # setting the style
#Retina display to see better quality images.
%config InlineBackend.figure_format = ‘retina’
# Lines below are just to ignore warnings
import warnings;
warnings.filterwarnings(‘ignore’)
Library versions in use!

*******************************************************************

2. Visualize KNN working — hands-on

Let’s generate a dataset with two classes and see how the knn algorithm work in reality for any new data points while assigning the class.

🧘🏻‍♂️The dataset🧘🏻‍♂️

We can use make_biclusters() from scikit-learn to create a simple dataset with 2 features (columns) and 50 observation (data points). We can also add some gaussian noise while creating clusters and assign them a class.

Let’s do this!

So, we have created a dataset using make_biclusters with two features “feature_1 & feature_2”. Target class is not balanced!

2.1: Visualize training and the test data

So, we have the data with two features and a target column. Let’s create a scatter plot and visualize the distribution of data points. We can use hue parameter for classes to show in different colors. In an other plot (right side), we can add a test point for which the class is not known and we want KNN to predict its class.

The code below should be self explanatory at this stage, its our 36th lecture in the series!

fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,figsize=(18,8))#Plotting data
sns.scatterplot(x=’feature_1', y=’feature_2', data=df, hue=’target’, ax=ax1, s=150)
ax1.set_title(“The data — two classes”)
# Plot our new point -- Star in the figure in right
test_point=[[10,50]]
sns.scatterplot(x=’feature_1', y=’feature_2', data=df, hue=’target’, ax=ax2,s=150)
ax2.scatter(x=test_point[0][0], y=test_point[0][1], color=”red”, marker=”*”, s=1000)
ax2.set_title(‘Red star is a test (unkown) point’)
plt.tight_layout();
Star in red is a new data point, we need to find its label class using KNN!

The red star is a new unknown data point that we want our KNN algorithm to predict, and for this purpose, we need to perform the following steps:

👉 Step-1: Compute the distances (using chosen metric) between test and all the data points.
👉 Step-2: Select 𝑘, the number of nearest neighbors from test data to claim the ownership of the test point by voting
👉 Step-3: Majority wins, so the class with more votes will be assigned to the test data

You might be thinking about a valid questionWhat happens when two classes have the same number of votes? Resolving ties! No worries, we will discuss this in a while!

2.2: Computing distances from test point

To make a decision, we need to know the distances of test point to all other data points. We can simply use pairwise_distances module from scikit-learn for this purpose. By default, it uses Euclidean distance metric, explore the documentation to know more on the available options.

A new column “distance” is added with computed pairwise distances.

2.3: Plotting distances and selected k value

So, we have computed the distances and added them into our original dataframe. It is good to visualize them and see how k value is affecting on the class for test data.

  • Let’s write a simple function to get the plots. The function only takes the value of k and valid for binary class classification. However, it can be easily customized for multi-class classification problem (tr yourself by making three or four clusters — a good exercise!)

Please read the comments, we are also dealing with possible ties in the implementation.

Let’s call the function above and get our plots with plotted pairwise distances for both classes in the target.

# Function call with selected k value
plot_knn_distances(k=20)
# To Do: Try different number for k and see what class you are getting!
# Try modifying the function to store the output class for test point with different k

Now, we have a complete understanding of knn algorithm and its working principle. It is one of the simplest algorithm for classification.

Let’s see how scikit-learn is implement this algorithm for us!

We will explore and implement KNN project using scikit-learn in the next lecture. Its only to compare here!

We actually know there are ties for 20 neighbors, right? Looks like scikit-learn is dealing with ties internally. It is always recommended to read the documentation of the version under use to see how the things are working for ties.

==>A rule of thumb is to always select an odd value for k, this will eliminate the root cause of ties.

*******************************************************************

3. Advantages and disadvantages

🧘🏻‍♂️KNN — Advantages🧘🏻‍♂️

  • Simple to understand and explain
  • Model training is fast
  • KNN is a lazy learner, the algorithm does not learn from the training data, it actually memorizes the training dataset.
  • Can be used for classification and regression
  • A non-linear algorithm.
  • The decision boundaries of KNN are locally linear segments, but in general have a complex shape that is not equivalent to a line in 2D or a hyperplane in higher dimensions.

🧘🏻‍♂️KNN — Disadvantages🧘🏻‍♂️

  • All training data must be stored
  • Prediction phase can be slow when test data is large
  • Sensitive to irrelevant features
  • Sensitive to feature scaling
  • Accuracy is (generally) not competitive with the best supervised learning methods

We also know that k is very important, we will find the best value of k for the selected data set in the next lecture. So far, it is enough!

*******************************************************************

4. Readings

*******************************************************************

5. A note on parametric & nonparametric models — good to know

Thus far, all of our tests and methods have been parametric. That is, we have assumed a certain distribution for our data. In linear regression our parameters are the coefficients in our model, and our estimate of the target is calculated from these parameters.

  • There are alternatives in the case where we cannot assume a particular distribution for our data or choose not to. These methods are nonparametric

=> When we make no assumptions about the distribution for our data, we call our data nonparametric. For nearly every parametric test, there is a nonparametric analog available.

The KNN model is an example of a nonparametric model. You can see that there are no coefficients for the different predictors and our estimate is not represented by a formula of our predictor variables.

*******************************************************************

6. Breaking ties

We already raised this question and implemented a way to tackle the situation, let’s revise!

👉 What if we have an equal amount of votes per class, to predict a point?

  • We can weight the samples that are closer a point to have more value in the voting process.
  • Using an odd quantity for 𝑘
  • can help avoid classes having the same score towards predicted points.

👉 KNN seems pretty simple. Is it?

  • If we are considering uniform distance value (no weighted assumptions), calculating the distances to known vs unknown points becomes very inefficient with larger datasets. Even with weighted distances, KNN is still not very efficient. Not a big deal, if things are working with our data!

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on A37: Importance of feature scaling in KNN — hands-on implementation.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.