A38: Logistic regression vs KNN — breast cancer dataset

Logistic regression vs KNN, breast cancer data, odd ratio

Junaid Qazi, PhD
8 min readFeb 9, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A37: Importance of feature scaling in KNN — hands-on implementation”

💐Click here to FOLLOW ME for new contents💐

⚠️ We will be using breast cancer dataset in this project.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️ 👉🎯

Hello guys,

So far, we have learned two models for classification, logistic regression and KNN. Recall no free lunch theorem, we need to find the best model for our data, right!

Let’s move and see which one works good for the given data (logistic or knn). This lecture is distributed as follows:

1. The breast cancer dataset
2. Basic imports
3. Loading data and EDA
4. Baseline model accuracy
5. Machine Learning

  • 5.1: Logistic regression
  • 5.2: KNN

6. Model Selection
7. Final model
8. To do

1. The breast cancer dataset

Most of the times, benign tumors are not dangerous (benign brain tumors can be life-threatening) as they lacks the ability to spread throughout the body. They even lacks the ability to invade neighboring tissue and can be removed with high chance of not growing back. Well, benign tumors can have other possible negative health effects, and through the process of tumor progression, many of their types can turn into malignant (cancerous). Breast cancer is one of the biggest challenge in female, which along with serious health complications, can end up mastectomy (partial of complete removal of one or both breasts by surgery). Hence early detection of the type of breast tumor and accordingly plan the treatment is extremely important.

Here is the link to the original breast cancer dataset, where we can find all the details and related research papers as well. The dataset have 569 observations and 30 features (all numeric).

🎯 The target classes M (Malignant) and B (Benign) types of breast cancer and the class distribution is:

🎯 212 — Malignant (represented by 0)
🎯 357 — Benign (represented by 1)

In the dataset, given below are the ten real-valued features that are computed for each cell nucleus:

👉 radius (mean of distances from center to points on the perimeter)
👉 texture (standard deviation of gray-scale values)
👉 perimeter
👉 area
👉 smoothness (local variation in radius lengths)
👉 compactness (perimeter² / area — 1.0)
👉 concavity (severity of concave portions of the contour)
👉 concave points (number of concave portions of the contour)
👉 symmetry
👉 fractal dimension (“coastline approximation” — 1)

==> The mean, standard error, and worst or largest (mean of the three worst/largest values) of the above features were computed for each image, resulting in 30 features. For example, field 0 is Mean Radius, field 10 is Radius SE, field 20 is Worst Radius (please see the data columns and/or summary).

Let’s try to find out our best model for the breast cancer data, knn or logistic regression!

*******************************************************************

2. Basic imports

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
sns.set(font_scale=1.3) # setting font size for the whole notebook
sns.set_style(“whitegrid”) # setting the style
#Retina display to see better quality images.
%config InlineBackend.figure_format = ‘retina’
# Lines below are just to ignore warnings
import warnings
warnings.filterwarnings(‘ignore’)

*******************************************************************

3. Loading data and EDA

As usual, a copy of the dataset is stored on GitHub, we are going to read using the raw data url from there.

data_url=”https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/Breast_Cancer_Diagnostic_Wisconsin.csv"
df=pd.read_csv(data_url)
# Try and see how the data look like
#df.head(2)
#df.info()
#df.describe() # good to look at the statistical summary
# Let's see the class distribution
# df.target.value_counts()

Well, there is no missing data and all the column are numeric, which saves lot’s of hassle. The class distribution looks fine as well, its not equal hut workable.

Moving forward, we can quickly look at the distributions of all or the selected features based on their type using groupby() function.

The histograms from mean radius and mean textures are quite information.

From the plot on left, we see that the mean radius have high predictive power as compare the the mean texture. Considering, we only know the mean radius, a value of 10 units mean its benign where as 20 units is malignant. The overlap region is where the chances or having any of the type.

From the plot on right, Mean texture have quite a log overlap, still, we expect lower value of mean texture for benign type of tumor.

We can actually plot all the features separated by the target class using for loop, that would help us to look at all the distribution!

A simple for loop to get the plots below! Spend some time to understand the code…!

Look at the distributions, the features for which class separation is clearer, are the strong predictors. Get more plots, see the correlations and so on……..EDA is important.

*******************************************************************

4. Baseline (model) accuracy

If we think about two situations where our model blindly predicts either of the class in target column for all the datapoints, our preferred model will be the one which predicts most frequently occurring class. This would be the baseline model.

Let’s find out the baseline accuracy score for the given dataset under such circumstances.

So, the baseline score is ~0.63 (~63%)

So, if the model always predict class 1 (Benign), its accuracy will be ~ 0.63 (~63%) -- the baseline accuracy. Out trained model must have higher performance.

==> By the way, in such screening project (typically in this data), it is very dangerous to have a model that only predicts class 1, we can put the real patients on much higher risk! (vice versa)

*******************************************************************

5. Machine Learning

We will train knn and logistic regression and compare the performance, the selection for the final model will best among the two!

>>Starting with some imports<<

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix

>>Separating features and the target, feature scaling and train-test split<<

We know the function of random_state, let’s try to get some number in a different way — fun to know little more!

So, we have scaled the features and separated data into (X_train, y_train) and (X_test, y_test) groups.

5.1: Logistic regression

Let’s start with logistic regression.

Nice and colorful visualization have greater impact, let write a function to get pretty confusion matrix this time!

5.2: KNN

Let’s see how KNN works with the data, we can use the default parameters and k =5 for the selected version on scikit-learn.

*******************************************************************

6. Model Selection

Let’s summarize the results (accuracy scores here) in a table to compare.

In this simple test, we see that the logistic regression model performs better than KNN on both the train and test datasets. Cross validation scores on the training datasets are slightly lower, however it is not much different to consider as a strong evidence of over-fitting.

*******************************************************************

7. Final model

So, we agreed on the procedural pipe line and the logistic regression as a best choice, we need to train the final model on complete dataset before we save it.

Questions:

  • Do you think you can further improve your model?
  • Do you thin default probability cut-off (0.5) is the best option? (recall logistic regression lecture)

*******************************************************************

8. To do

  • GridSearch to find the best combination of parameters for the logistic regression model
  • Elbow method to get the best value for k

>>(Optional here) Visualizing coefficients of logistic regression (the selected model)<<

In case, you want to visualize the coefficients of your final logistic regression model (recall the previous lectures on logistic regression).

# creating dataframe just for visualization
coeffs=pd.DataFrame() # empty dataframe
coeffs[“coef”]=logR.coef_[0] # column with coefficients
coeffs[“exp_coef”] = np.exp(coeffs[“coef”]) # column with exp(coefficients)
coeffs[“feature”]=df.columns[:-1].str.title() # converting string to titlecase.
coeffs = coeffs.sort_values(“exp_coef”, ascending = False) # sorting
# getting only top and bottom 5
top_5=coeffs.iloc[:5, :]
bottom_5=coeffs.iloc[-5:, :]
coeffs_for_plot = pd.concat([top_5,bottom_5])

So, we have top 5 and bottom 5 features in “coeffs_for_plot” data frame, lets get a bar plots for coefficients and their exponents (odd ratio) — recall the lecture on logistic regression — behind the scene!

Please consult precious lectures on logistic regression to explain the coefficients!

Interpretation review:

  • (most common) All else held equal, for a 1 unit increase in certain feature 𝑥1, an observation is 𝑒^𝛽1= Some_value TIMES AS LIKELY to be the relevant class.
  • (less friendly but still correct) All else held equal, for a 1 unit increase in in certain feature 𝑥1, the log-odds of being the relevant class increases by 𝛽1 units (or decreases by −𝛽1units).
  • (as percentage): All else held equal, for a 1 unit increase in certain feature 𝑥1, an observation is (𝑒^𝛽1−1)×100= Some_value_% more likely (or Some_value_% less likely) of being a relevant class.

Please remember, the model estimates all coefficients by taking all features into account. Keep this in your mind when looking at the coefficients individually. We are only visualizing the top and bottom 5 in the plots above.

*******************************************************************

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on “A39: Decision Tree and Random Forests — Theory”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

**************************************************************************************************************************************

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

**************************************************************************************************************************************

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.