A33: Handling imbalanced classes in the dataset.
Imbalanced data, oversampling, SMOTE — Synthetic Minority Over-sampling Technique, Cohen Kappa, model performance and much more ……. !
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
⚠️ Highly imbalanced public dataset on “bioassay” is used in this hands-on project.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
🧘🏻♂️Topics to be covered:
- Imbalance datasets and techniques to handle
- The Bioassay Dataset
- 2.1: Machine learning — imbalance data
- 2.2: Machine Learning — oversampled data
- 2.3: Machine Learning — oversampled using SMOTE
- 👉 Accuracy Score
- 👉 Area under ROC
- 👉 Cohen Kappa
3. Performance of the trained models on unseen data
4. Additional — Finding the right parameter
5. To Do
1. Imbalance datasets and techniques to handle
Class imbalance is a common problem in classification datasets, where the number of datapoints or observations are not same across all the classes present in the target column.
The smaller differences are not a trouble however, there are cases when the dataset have extreme class imbalance, e.g.:
- Disease screening
=>
We got the dataset to develop a machine learning model that can screen the CoVID-19 patients. In the dataset, we have only 5 CoVID-19 positive cases against 95 CoVID-19 negative cases. Say, we have 1000 observation (100 positive cases and 900 negative cases).
Suppose, we train our model on such CoVID-19 dataset and we are happy to see the classifier’s accuracy above 95% with minimal efforts, do you think we can trust the model trained on the dataset with class distribution of 5:95?Its an accuracy paradox
, where the numbers are actually reflecting the underlying class distribution in the imbalanced dataset. Well, think about it, the baseline accuracy in this case is actually 95%!
- Fraud detection
=>
another very practical example, where only a small fraction of fraudulent cases are present against the fair ones. Sometimes even, 1:1000 or 1:5000.....Learning from imbalanced data: open challenges and future directions
Class imbalance in the dataset can cause a lot of frustration and needs to be treated. There are options that we can think off to handle this issue.
- 👉 Can we collect more data — sometime or even most of the time, not very easy, still one of the best solutions for the long run.
- 👉 Can we generate synthetic data — relatively easy and cost effective than collecting more data, however little tricky. One of the most common technique is SMOTE — Synthetic Minority Over-sampling Technique (2002) which creates synthetic data from the minor class instead of simply copying the instances.
- 👉 Re-sampling over & under sampling — we can think about creating copies of minority class with (over sampling) or delete the instances of majority class (under sampling). It is important to remember that under sampling is loosing information, one should carefully consider this option, most likely when we have thousands and thousands of class instances. Both strategies needs to be compared along with different ratios of class representation in the data, ideal is 1:1 for binary class classification problem.
🧘🏻♂️ Stay calm>>A very useful python library imbalanced-learn provide range of re-sampling techniques that can be easily implemented and tested for the results on imbalanced datasets.🧘🏻♂️
Along with the above techniques, we can think about;
- testing different algorithms,
- decomposing the majority class into smaller datasets with random sub-sampling and training several subsets using ensemble methods………!
Well, we must try all possible and creative options!
People have their own experiences and they usually share them publicly, learn from them and see if you can get your work done!
Let’s get thing done and do the required imports!
2. The Bioassay Dataset
The drug-development process is both time-consuming and expensive; it takes an average of 15 years and over 800 million dollars to bring a drug to the market.
BioAssay is an analytical method to determine the concentration or potency of a substance by its effect on living animals/cells (in-vivo) or tissue/cell culture systems (in-vitro). This Bioassay dataset is highly imbalanced dataset from UCI machine learning repository. Here is a link to the original published article.
Always good to understand your data well!
So, we actually have 145 columns.
We can run a for loop to quickly see if there is any column with the missing data.
Let’s write a function for this purpose, you can reuse this function for any other dataset.
Should we visualize the class imbalance? I think yes!
What we can do, we can grab any two features from the dataset and create a scatter plot showing active (Red) and inactive (DarkBlue) with different colours. Another important thing, use alpha (transparency parameter), there might be several datapoint overlapping!
Let’s work with the data as is and see how this class imbalance is impacting our machine learning algorithm!
2.1: Machine learning — imbalance data
Let’s separate features in X and target in y and then use out typical train_test_split function for data splitting!
# Separating features and the target
X=df.drop(‘Outcome’,axis=1)
y=df.Outcome
Scaling the features in an important step, I will leave it on you to see the difference, try yourself and compare! The code below will be helpful.
Let’s quickly check the percentage of minority class in the training part of our dataset (X_train, y_train)
Try yourself for the (X_test, y_test) and see the class distribution!
#Try yourself
#y_test.value_counts() # see what is the situation in the test set!
>>🧘🏻♂️Model training on (X_train, y_train)
🧘🏻♂️<<
So, we decided to work with the imbalanced dataset at first, let’s train a classical logistic regression model and see how the accuracy score looks like!
from sklearn.linear_model import LogisticRegression
from sklearn import metrics# Creating model instances
logR = LogisticRegression(max_iter=10000)# fitting the model
logR.fit(X_train,y_train)# Accuracy Score
print("Accuracy Score for (X_train, y_train):",logR.score(X_train,y_train))
Well, the numbers look amazing with accuracy ~ 98%. Well, the minority class is only 1.5% which makes the baseline accuracy above 98%…….!
Can we trust the model or this is an Accuracy Paradox?
>>🧘🏻♂️Model evaluation on test dataset (X_test, y_test)
🧘🏻♂️ <<
Moving forward, let’s make predictions from the trained model and see how the confusion matrix and classification report looks like.
# predictions
print(“For: (X_test, y_test)\n”)
pred = logR.predict(X_test)# Confusion matrix and classification report
#print(“multi_class=’ovr’\n”)
print(“The confusion matrix:”)
print(metrics.confusion_matrix(y_test,pred))
print(“\nThe classification report:”)
print(metrics.classification_report(y_test,pred))
==>
Just check the recall
and f1-score
! >> (please note, we have discussed these statistical measure in out previous lecture in details, click here to refresh!)
You can clearly see the impact of class imbalanced, the model is not generalized and looks like it has ONLY leaned to predict the majority class only!
This is not what we want! We actually want to predict the active cases, they are the real culprits.
2.2: Machine Learning — oversampled data
Let’s start with the simplest method of over-sampling the minority class. We are simply creating copies of the observations from selected class with replacement.
==>
You can try under-sampling of majority class, however its loosing the data (try yourself and compare the results).
# Over-sampling minority class >> creating 3000 copies over_sampled=df[df[‘Outcome’]==’Active’].sample(3000,replace=True,random_state=42)# Concatenating the over-sampled data
over_sampled = pd.concat([df, over_sampled])
So, we have new data frame, over_sampled
Let’s visualize the class imbalance using scatter plot (reusing the same code as in the above scatter plot)
Well, its creating a copy of same observations for minority class, we will not see more points in the scatter plot as they are overlapped, however, the overlapped points are darker in its respective colour!
Let’s separate features and the target, as always for the machine learning.
>>🧘🏻♂️Machine learning model training (X_os_train, y_os_train)
🧘🏻♂️<<
Let’s train another logistic regression model using the over-sampled dataset and see is there are any improvements.
The accuracy score is lower this time. Remember your class distribution after over-sampling.
Anyhow, let’s evaluate the model.
>>🧘🏻♂️Model evaluation on test dataset (X_os_test, y_os_test)
🧘🏻♂️ <<
Let’s get the predictions for the part of our dataset which is unseen to our developed machine learning model. We will get the confusion matrix and classification report as well!
# predictions
print(“Data is: (X_os_train, y_os_train)”)
print(“Over-sampled — Copies of the data\n”)
pred_os_test = logR_os.predict(X_os_test)# Confusion matrix and classification report
print(“The confusion matrix:”)
print(metrics.confusion_matrix(y_os_test,pred_os_test))
print(“\nThe classification report:”)
print(metrics.classification_report(y_os_test,pred_os_test))
The model looks much better and improved for the minority class, its generalized and also look at the recall and f1-score …….. !
Can we try different approach of re-sampling that can add some variance along with new data points for the minority class? Well, off course, let’s try SMOTE to create synthetic data!
2.3: Machine Learning — oversampled using SMOTE
If you have not installed imbalanced-learn, you can install.
At first place, we separated features in X and the target in y from the original dataframe. Let’s import SMOTE()
from the newly installed library and create its instance to generate synthetic data!
As above, let’s visualize the class balance, the code to get the scatter plot is same as for the above scatter plots!
# Let’s visualize the imbalance! — Any two features to get a scatter plot
# Creating an intermediate dataframe for this visualizations onlydf_smote=X_smote.copy()
df_smote[‘Outcome’]=y_smote
plt.figure(figsize=(18,6))#Class 1 ==> Inactive
sns.scatterplot(data=df_smote[df_smote.Outcome==’Inactive’],x=’MW’,y=’PSA’,s=200,alpha=0.4,color=’DarkBlue’)#Class 2 ==> Active
sns.scatterplot(data=df_smote[df_smote.Outcome==’Active’],x=’MW’,y=’PSA’,s=200, alpha=0.4,color=’Red’)
plt.title(“Red is the over-sampled class using SMOTE”);
We can see, the synthetic samples are not just the copies of the same observations now!
>>🧘🏻♂️Model training on (X_smote_train, y_smote_train)
🧘🏻♂️<<
Let’s get the model using oversampled data using SMOTE!
>>🧘🏻♂️Model evaluation on test dataset (X_smote_test, y_smote_test)
🧘🏻♂️ <<
Time to evaluate our model that is trained on synthetically over-sampled data using SMOTE.
So, we have three models trained and tested on training and test parts of the data:
- logR — trained on imbalanced data
- logR_os — trained on oversampled (copies of the data points) data
- logR_smote — trained on synthetically created data using SMOTE
Let’s move on and use these models to make predictions another set of unseen data
3. Performance of the trained models on UNSEEN data
🧘🏻♂️ Just to remind you, in reality, we don’t have the labels available for unseen.🧘🏻♂️
SO, we also have a separate UNSEEN dataset that we can use to evaluate our trained models.
Let’s read the UNSEEN data and see the performance of our trained models.
👉 <<NOTE >> remember all the models that we have developed are not using scaled data, so you don’t need to scale the data, however, it is not a good practice and we should always scaled the features and save the scaling transformation for later use (see the previous lectures)
So, we have trained three models:
logR
on the original datalogR_os
on the data after oversampling (creating copies) the minority classlogR_smote
using SMOTE to create synthetic data
Let’s check their performance on the UNSEEN data.
>>🧘🏻♂️Accuracy score🧘🏻♂️<<
👉 Results for the unseen data using the model trained on imbalanced dataset >>logR<<
👉 Results for the unseen data using the model trained on balanced dataset (oversampled) >>logR_os<<
👉 Results for the unseen data using the model trained on balanced dataset (SMOTE) >>logR_smote<<
If you compare the above results, you can see that the performance of the model trained on added synthetically created data (over-sampled) using SMOTE is much better than others. You can try feature scaling and re-run all the models, compare the results and see if you can further improve your models.
>>🧘🏻♂️Area under ROC🧘🏻♂️<<
Changing the performance matrix is helpful and for general purpose, AUC-ROC is useful. (You can plot ROC-curves as explained in the previous lectures)
>>🧘🏻♂️Cohen Kappa🧘🏻♂️<<
Cohen’s kappa statistic is a useful measure that shows the agreement between two raters who each classify N items into C mutually exclusive categories. This is a score that expresses the level of agreement between two annotators (y_true and predictions in our case) on a classification problem.
Want to explore the formulas, see the definition section here
𝑘=1 shows a complete agreement, whereas, 𝑘=0 suggest there is no agreement other than what would be expected by chance. It could have a negative value which shows the agreement is worse than random.
kappa is a useful but under-utilized metric that can handle imbalanced class problems. It basically tells you how much better your classifier is performing over the performance of a classifier that simply guesses at random according to the frequency of each class. Reference to the original article
Let’s compute the kappa metric from scikit-learn.
👉 It is important to remember that imbalanced data is not easy to work with.
It is also tricky to decide about a specific evaluation matrix for our machine learning model for such datasets, however these matrices provide good guidelines to the reliability of our machine learning model.
- In-depth analysis of confusion matrix to know the correct and incorrect predicts are also helpful.
- Precision (the model’s exactness) and recall (fraction of the relevant outcome that are successfully predicted) are also important matrices for guidance, the weighted average of both gives the f1-score.
The results from unseen data shows that the sampling has helped to improve the model, which sampling technique is the best…..well I don’t have any straight answer to this question. ==>
We need to try all possible techniques and tricks to see what works for the dataset under analysis.
We can try looking for different parameters along with using extensions of SMOTE
given in To Do
section!
Tree-based algorithms (Random Forests, Gradient Boosted Trees …..) works well on imbalanced datasets as their hierarchical structure helps them to learn the patterns from both classes, you can try them once we cover these topics in the coming lectures.
4. (Additional)-Finding the right parameter
(K and the sampling strategy in SMOTE)
k — the number of neighbours — and sampling strategy are important parameter in SMOTE, we can try to optimize these parameter to see if we can improve the model performance. For this purpose, let’s try nested for loop at the moment, we will learn the GridSearch later on, its just a heads-up!
In the above example, we don’t see much improvements in roc_aus score, you can explore other parameters, try different strategies (given in To Do) to improve you model, its a continuous process!
5. To Do
Try exploring other sampling options, most of them are the extensions of SMOTE and can be easily implemented using imbalance-learn library, the links below will be helpful!
- Over-sampling using imbalance-learn
- Over-sample using Adaptive Synthetic (ADASYN) algorithm — — Original Article
- Over-sampling using Borderline SMOTE — — Original Article
- Over-sampling using SVM-SMOTE. — — Original article
- Do extensive EDA, learn about the data, and see if you can improve the model.
All done at the moment!
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A34: Handling Missing Data, Interaction Terms, Grid-Search, Model Training and Evaluation!”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.