A35: K-Nearest-Neighbors (KNN)-Behind The Scene!

KNN, K-Nearest-Neighbors, Curse of Dimensionality, effect of k (number of neighbors)

Junaid Qazi, PhD
5 min readJan 27, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A34: Handling Missing Data, Interaction Terms, Grid-Search, Model Training and Evaluation.”

💐Click here to FOLLOW ME for new contents💐

🧘🏻‍♂️Welcome guys to the K-Nearest-Neighbors lecture!🧘🏻‍♂️

K-Nearest-Neighbors’ (also known as KNN) is another very important, widely used, and one of the simplest among all machine learning algorithms for classification (it is also used for regression problems).

  • The principle behind the KNN method is to find a pre-defined number of training samples, closest in distance to the test point, and predict the label from these.

In simple words, KNN stores entire dataset as a training algorithm and categorizes the test datapoint using stored information of its nearest neighbors, k is the number of neighbors.

Let’s learn with and example data!

Consider, we are given with the dataset in the picture below:

  • We have a dataset with two classes, ’A’ & ‘B’ and the entire data along with its features, f1, f2 are stored as a training algorithm.
  • We want to predict the class for red, green & blue test data points.
Code used to generate this dataset is at scitkit-learn
  • Based on the association with their neighboring points, its straight forward to predict the class for red and green.
  • The nearest neighboring datapoint for red are in class ‘B’, so KNN will predict B for red
  • The nearest neighboring datapoint for green are in ‘A’, so KNN will predict class A for green

⚠️ What about blue data point in the figure above?

To predict the class for blue datapoint, let’s discuss KNN with another simpler example of fewer data-points.

We already know, the training algorithm is actually storing entire dataset in KNN, right?

For predictions, KNN simply looks at the “k” points in the training dataset that are nearest to the test input, counts how many members of each class are in the set, and return the class with highest frequency as a prediction. So, the class with the most votes in taken as the prediction.

In other words, the algorithm calculate the distance from the test point to all data-points in the train dataset, sort the points by increasing distance from the test input into a distance matrix and predict the majority class for k closest points.

Well, value of k does matter! fair use policy

Did you notice, the value of ‘k’ will affect the predictions?

👉 Selection of k is very important and we will learn how to select k for our dataset in hands-on lecture!

KNN is simple and works quite well with small number of input variables (features or dimensions), provided it is given a good distance metric and has enough labeled training data. However, the main problem with KNN is that it does not work well with high dimensional inputs.The poor performance in high dimensional settings is due to the Curse of Dimensionality.

👉 Let’s understand the concept of Curse of Dimensionality with a very simple example. Consider, you have a straight line which is 100 yards long, and you dropped a penny somewhere on it. It would not be too hard to find a penny along this 100 yards long line. You walk along the line and it takes two minutes. this one line is a single dimension!

Curse of Dimensionality! >> 1D example using straight line on left >> 2D example using square in the middle >> 3D example using cube on right. What about increasing number of dimensions/features in your data?

Now let’s say you have a square of 100 yards on each side, so you have 2 dimensions here. You dropped a penny somewhere on it. It would be pretty hard, like searching across two football fields stuck together. What do you thin, might take a day to find a lost penny!

👉 What if you have a cube with 100 yards along each side, and you lost your penny? How difficult it would be to find the lost penny?

👉 This is an example of only three features/dimensions. It’s like searching a 30-story building the size of a football stadium.

👉 The difficulty of searching through the space gets a “a lot” harder as you have more dimensions.

When the number of features or dimensions is large, there tends to be a deterioration in the performance of KNN and other local approaches that perform prediction using only observations that are near the test observation for which a prediction must be made. This phenomenon is known as the curse of dimensionality.

🧘🏻‍♂️Let’s try to understand curse of dimensionality with another example diagram!

Illustration is re-created (Source — p35 — Pattern Recognition and Machine Learning (2006) by Bishop)

The illustration above depicts a simple example of the curse of dimensionality. From left to right, we can see how the number of regions of a regular grid grows exponentially with the dimensionality of the space.

👉 Typically, the performance of KNN model (classifier) is related to the number of features (dimensions). In the beginning, as the dimensionality increases, the KNN classifier’s performance increases (see the illustration given below).

👉 The model’s performance in maximum once we reach the optimal number of features. If we further increase the dimensionality without increasing the number of training samples, the model’s performance starts decreasing.

So, it's just a warmup lecture on KNN, we will discuss the type of distances that are used in KNN, how to break ties, advantages, disadvantages, and much more with code examples in the next lecture. See you there….!

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on A36: KNN — Understand With Hands-on Code!”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.