A33: Handling imbalanced classes in the dataset.

Imbalanced data, oversampling, SMOTE — Synthetic Minority Over-sampling Technique, Cohen Kappa, model performance and much more ……. !

Junaid Qazi, PhD
13 min readJan 20, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A32: Multi-class Classification using Logistic Regression.”

💐Click here to FOLLOW ME for new contents💐

⚠️ Highly imbalanced public dataset on “bioassay” is used in this hands-on project.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

🧘🏻‍♂️Topics to be covered:

  1. Imbalance datasets and techniques to handle
  2. The Bioassay Dataset
  • 2.1: Machine learning — imbalance data
  • 2.2: Machine Learning — oversampled data
  • 2.3: Machine Learning — oversampled using SMOTE
  • 👉 Accuracy Score
  • 👉 Area under ROC
  • 👉 Cohen Kappa

3. Performance of the trained models on unseen data

4. Additional — Finding the right parameter

5. To Do

1. Imbalance datasets and techniques to handle

Class imbalance is a common problem in classification datasets, where the number of datapoints or observations are not same across all the classes present in the target column.

The smaller differences are not a trouble however, there are cases when the dataset have extreme class imbalance, e.g.:

  • Disease screening => We got the dataset to develop a machine learning model that can screen the CoVID-19 patients. In the dataset, we have only 5 CoVID-19 positive cases against 95 CoVID-19 negative cases. Say, we have 1000 observation (100 positive cases and 900 negative cases).

Suppose, we train our model on such CoVID-19 dataset and we are happy to see the classifier’s accuracy above 95% with minimal efforts, do you think we can trust the model trained on the dataset with class distribution of 5:95?Its an accuracy paradox, where the numbers are actually reflecting the underlying class distribution in the imbalanced dataset. Well, think about it, the baseline accuracy in this case is actually 95%!

Class imbalance in the dataset can cause a lot of frustration and needs to be treated. There are options that we can think off to handle this issue.

  • 👉 Can we collect more data — sometime or even most of the time, not very easy, still one of the best solutions for the long run.
  • 👉 Can we generate synthetic data — relatively easy and cost effective than collecting more data, however little tricky. One of the most common technique is SMOTE — Synthetic Minority Over-sampling Technique (2002) which creates synthetic data from the minor class instead of simply copying the instances.
  • 👉 Re-sampling over & under sampling — we can think about creating copies of minority class with (over sampling) or delete the instances of majority class (under sampling). It is important to remember that under sampling is loosing information, one should carefully consider this option, most likely when we have thousands and thousands of class instances. Both strategies needs to be compared along with different ratios of class representation in the data, ideal is 1:1 for binary class classification problem.

🧘🏻‍♂️ Stay calm>>A very useful python library imbalanced-learn provide range of re-sampling techniques that can be easily implemented and tested for the results on imbalanced datasets.🧘🏻‍♂️

Along with the above techniques, we can think about;

  • testing different algorithms,
  • decomposing the majority class into smaller datasets with random sub-sampling and training several subsets using ensemble methods………!

Well, we must try all possible and creative options!

People have their own experiences and they usually share them publicly, learn from them and see if you can get your work done!

Let’s get thing done and do the required imports!

2. The Bioassay Dataset

The drug-development process is both time-consuming and expensive; it takes an average of 15 years and over 800 million dollars to bring a drug to the market.

BioAssay is an analytical method to determine the concentration or potency of a substance by its effect on living animals/cells (in-vivo) or tissue/cell culture systems (in-vitro). This Bioassay dataset is highly imbalanced dataset from UCI machine learning repository. Here is a link to the original published article.

Here is a link to BioAssay and several other datasets >> github

Always good to understand your data well!

info() is a useful function to get an overview!

So, we actually have 145 columns.

We can run a for loop to quickly see if there is any column with the missing data.

Let’s write a function for this purpose, you can reuse this function for any other dataset.

No missing column in the dataset >> 3375 inactive & 48 active >> well extreme class imbalance!

Should we visualize the class imbalance? I think yes!

What we can do, we can grab any two features from the dataset and create a scatter plot showing active (Red) and inactive (DarkBlue) with different colours. Another important thing, use alpha (transparency parameter), there might be several datapoint overlapping!

In red is active and in blue is inactive >> Visualizing class imbalance is more friendly!

Let’s work with the data as is and see how this class imbalance is impacting our machine learning algorithm!

2.1: Machine learning — imbalance data

Let’s separate features in X and target in y and then use out typical train_test_split function for data splitting!

# Separating features and the target
X=df.drop(‘Outcome’,axis=1)
y=df.Outcome

Scaling the features in an important step, I will leave it on you to see the difference, try yourself and compare! The code below will be helpful.

Few lines for you to rescale the data, try yourself and see the difference!
Splitting the data into train and test, out of 48, 36 active cases goes into training part using the above setting in train_test_split function.

Let’s quickly check the percentage of minority class in the training part of our dataset (X_train, y_train)

1.5% active and 98.5% inactive class, right?

Try yourself for the (X_test, y_test) and see the class distribution!

#Try yourself
#y_test.value_counts() # see what is the situation in the test set!

>>🧘🏻‍♂️Model training on (X_train, y_train) 🧘🏻‍♂️<<

So, we decided to work with the imbalanced dataset at first, let’s train a classical logistic regression model and see how the accuracy score looks like!

from sklearn.linear_model import LogisticRegression
from sklearn import metrics
# Creating model instances
logR = LogisticRegression(max_iter=10000)
# fitting the model
logR.fit(X_train,y_train)
# Accuracy Score
print("Accuracy Score for (X_train, y_train):",logR.score(X_train,y_train))
What do you think? Accuracy score is very high! Remember, we have minority class ~ 1.5%

Well, the numbers look amazing with accuracy ~ 98%. Well, the minority class is only 1.5% which makes the baseline accuracy above 98%…….!

Can we trust the model or this is an Accuracy Paradox?

>>🧘🏻‍♂️Model evaluation on test dataset (X_test, y_test)🧘🏻‍♂️ <<

Moving forward, let’s make predictions from the trained model and see how the confusion matrix and classification report looks like.

# predictions 
print(“For: (X_test, y_test)\n”)
pred = logR.predict(X_test)
# Confusion matrix and classification report
#print(“multi_class=’ovr’\n”)
print(“The confusion matrix:”)
print(metrics.confusion_matrix(y_test,pred))
print(“\nThe classification report:”)
print(metrics.classification_report(y_test,pred))
WOW >> a model with over 98% of training accuracy is not able to predict a single active class! Do you want to trust in such machine learning models?

==> Just check the recall and f1-score! >> (please note, we have discussed these statistical measure in out previous lecture in details, click here to refresh!)

You can clearly see the impact of class imbalanced, the model is not generalized and looks like it has ONLY leaned to predict the majority class only!

This is not what we want! We actually want to predict the active cases, they are the real culprits.

2.2: Machine Learning — oversampled data

Let’s start with the simplest method of over-sampling the minority class. We are simply creating copies of the observations from selected class with replacement.

==> You can try under-sampling of majority class, however its loosing the data (try yourself and compare the results).

# Over-sampling minority class >> creating 3000 copies over_sampled=df[df[‘Outcome’]==’Active’].sample(3000,replace=True,random_state=42)# Concatenating the over-sampled data
over_sampled = pd.concat([df, over_sampled])

So, we have new data frame, over_sampled

Well, nice balance in the class now!

Let’s visualize the class imbalance using scatter plot (reusing the same code as in the above scatter plot)

>> alpha=0.4 << Still red points are standing out, they are actually several overlapped data points, we have created 3000 copies of the data points from minority class.

Well, its creating a copy of same observations for minority class, we will not see more points in the scatter plot as they are overlapped, however, the overlapped points are darker in its respective colour!

Let’s separate features and the target, as always for the machine learning.

Try scaling the features and re-run your model.
Getting training and the test datasets from our over-sampled dataframe!

>>🧘🏻‍♂️Machine learning model training (X_os_train, y_os_train)🧘🏻‍♂️<<

Let’s train another logistic regression model using the over-sampled dataset and see is there are any improvements.

The accuracy score is lower this time. Remember your class distribution after over-sampling.

Anyhow, let’s evaluate the model.

>>🧘🏻‍♂️Model evaluation on test dataset (X_os_test, y_os_test)🧘🏻‍♂️ <<

Let’s get the predictions for the part of our dataset which is unseen to our developed machine learning model. We will get the confusion matrix and classification report as well!

# predictions 
print(“Data is: (X_os_train, y_os_train)”)
print(“Over-sampled — Copies of the data\n”)
pred_os_test = logR_os.predict(X_os_test)
# Confusion matrix and classification report
print(“The confusion matrix:”)
print(metrics.confusion_matrix(y_os_test,pred_os_test))
print(“\nThe classification report:”)
print(metrics.classification_report(y_os_test,pred_os_test))
Compare the results with the model that was trained on imbalanced data.

The model looks much better and improved for the minority class, its generalized and also look at the recall and f1-score …….. !

Can we try different approach of re-sampling that can add some variance along with new data points for the minority class? Well, off course, let’s try SMOTE to create synthetic data!

2.3: Machine Learning — oversampled using SMOTE

If you have not installed imbalanced-learn, you can install.

At first place, we separated features in X and the target in y from the original dataframe. Let’s import SMOTE() from the newly installed library and create its instance to generate synthetic data!

X_smote, y_smote is over new dataset with synthetically created data from the minority class.
Looks good, we have balanced class now!

As above, let’s visualize the class balance, the code to get the scatter plot is same as for the above scatter plots!

# Let’s visualize the imbalance! — Any two features to get a scatter plot
# Creating an intermediate dataframe for this visualizations only
df_smote=X_smote.copy()
df_smote[‘Outcome’]=y_smote
plt.figure(figsize=(18,6))
#Class 1 ==> Inactive
sns.scatterplot(data=df_smote[df_smote.Outcome==’Inactive’],x=’MW’,y=’PSA’,s=200,alpha=0.4,color=’DarkBlue’)
#Class 2 ==> Active
sns.scatterplot(data=df_smote[df_smote.Outcome==’Active’],x=’MW’,y=’PSA’,s=200, alpha=0.4,color=’Red’)
plt.title(“Red is the over-sampled class using SMOTE”);
Well, the data is not simply the copies now, there is some added variance >> read the beginning of this article to understand working of SMOTE, a link to the original article is also given.

We can see, the synthetic samples are not just the copies of the same observations now!

Once again, feature scaling and retraining the model is your To Do. We can do the data split 😊

>>🧘🏻‍♂️Model training on (X_smote_train, y_smote_train) 🧘🏻‍♂️<<

Let’s get the model using oversampled data using SMOTE!

Much higher accuracy score using synthetic data!

>>🧘🏻‍♂️Model evaluation on test dataset (X_smote_test, y_smote_test)🧘🏻‍♂️ <<

Time to evaluate our model that is trained on synthetically over-sampled data using SMOTE.

This is much better model, generalized and check the statistical matrices they look good!

So, we have three models trained and tested on training and test parts of the data:

  • logR — trained on imbalanced data
  • logR_os — trained on oversampled (copies of the data points) data
  • logR_smote — trained on synthetically created data using SMOTE

Let’s move on and use these models to make predictions another set of unseen data

3. Performance of the trained models on UNSEEN data

🧘🏻‍♂️ Just to remind you, in reality, we don’t have the labels available for unseen.🧘🏻‍♂️

SO, we also have a separate UNSEEN dataset that we can use to evaluate our trained models.

Let’s read the UNSEEN data and see the performance of our trained models.

👉 <<NOTE >> remember all the models that we have developed are not using scaled data, so you don’t need to scale the data, however, it is not a good practice and we should always scaled the features and save the scaling transformation for later use (see the previous lectures)

So, we have trained three models:

  • logR on the original data
  • logR_os on the data after oversampling (creating copies) the minority class
  • logR_smote using SMOTE to create synthetic data

Let’s check their performance on the UNSEEN data.

>>🧘🏻‍♂️Accuracy score🧘🏻‍♂️<<

Re-create you notebook, writing code is a nice way to learn!

👉 Results for the unseen data using the model trained on imbalanced dataset >>logR<<

👉 Results for the unseen data using the model trained on balanced dataset (oversampled) >>logR_os<<

👉 Results for the unseen data using the model trained on balanced dataset (SMOTE) >>logR_smote<<

If you compare the above results, you can see that the performance of the model trained on added synthetically created data (over-sampled) using SMOTE is much better than others. You can try feature scaling and re-run all the models, compare the results and see if you can further improve your models.

>>🧘🏻‍♂️Area under ROC🧘🏻‍♂️<<

Changing the performance matrix is helpful and for general purpose, AUC-ROC is useful. (You can plot ROC-curves as explained in the previous lectures)

>>🧘🏻‍♂️Cohen Kappa🧘🏻‍♂️<<

Cohen’s kappa statistic is a useful measure that shows the agreement between two raters who each classify N items into C mutually exclusive categories. This is a score that expresses the level of agreement between two annotators (y_true and predictions in our case) on a classification problem.

Want to explore the formulas, see the definition section here

𝑘=1 shows a complete agreement, whereas, 𝑘=0 suggest there is no agreement other than what would be expected by chance. It could have a negative value which shows the agreement is worse than random.

kappa is a useful but under-utilized metric that can handle imbalanced class problems. It basically tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class. Reference to the original article

Let’s compute the kappa metric from scikit-learn.

Well, the Kappa is zero for the model trained on imbalanced data!

👉 It is important to remember that imbalanced data is not easy to work with.

It is also tricky to decide about a specific evaluation matrix for our machine learning model for such datasets, however these matrices provide good guidelines to the reliability of our machine learning model.

  • In-depth analysis of confusion matrix to know the correct and incorrect predicts are also helpful.
  • Precision (the model’s exactness) and recall (fraction of the relevant outcome that are successfully predicted) are also important matrices for guidance, the weighted average of both gives the f1-score.

The results from unseen data shows that the sampling has helped to improve the model, which sampling technique is the best…..well I don’t have any straight answer to this question. ==> We need to try all possible techniques and tricks to see what works for the dataset under analysis.

We can try looking for different parameters along with using extensions of SMOTE given in To Do section!

Tree-based algorithms (Random Forests, Gradient Boosted Trees …..) works well on imbalanced datasets as their hierarchical structure helps them to learn the patterns from both classes, you can try them once we cover these topics in the coming lectures.

4. (Additional)-Finding the right parameter

(K and the sampling strategy in SMOTE)

k — the number of neighbours — and sampling strategy are important parameter in SMOTE, we can try to optimize these parameter to see if we can improve the model performance. For this purpose, let’s try nested for loop at the moment, we will learn the GridSearch later on, its just a heads-up!

In the above example, we don’t see much improvements in roc_aus score, you can explore other parameters, try different strategies (given in To Do) to improve you model, its a continuous process!

5. To Do

Try exploring other sampling options, most of them are the extensions of SMOTE and can be easily implemented using imbalance-learn library, the links below will be helpful!

All done at the moment!

💐Click here to FOLLOW ME for new contents💐

🌹Keep practicing to brush-up & add new skills🌹

✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅

Good luck!

See you in the next lecture on “A34: Handling Missing Data, Interaction Terms, Grid-Search, Model Training and Evaluation!”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.