A34: Handling Missing Data, Interaction Terms, Grid-Search, Model Training and Evaluation
Missing data, listwise vs pairwise deletion, single imputation, model-based imputation, complete case analysis, feature engineering, grid-search, machine learning model training and evaluation — A step-by-step tutorial to predict kidney disease!
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
Click here for the previous article/lecture on “A33: Handling imbalanced classes in the dataset.”
⚠️ We will be using real research dataset in this project. You will see “To Do” tasks for your learning activity.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
Just for fun!
Your objective:
🧘🏻♂️==>
Fundraising? ======>
It's AI
🧘🏻♂️==>
Hiring? ============>
It's Machine Learning
🧘🏻♂️==>
Implementing? ====>
It's Regression (Linear/Logistic)
Well, in most of the cases, its TRUE, right….😊?
Predicting Chronic Kidney Disease (CKD)
******************************************************************
- Typically, the process pipeline for any data science project involves:
- 🧘🏻♂️ Problem definition
- 🧘🏻♂️ Data collection
- 🧘🏻♂️ Exploratory data analysis and preprocessing
- 🧘🏻♂️ Model training and evaluation
- 🧘🏻♂️ Communicating the answer — presentation, reports and/or deployment
It’s iterative mostly, and we can add more steps and/or sub-steps in the pipeline.
*******************************************************************
Well, this is a typical life in the data science business. Data science is a continuous process, a model may not be good for new data and needs major updates in the process pipeline, more likely in preprocessing.
This article is arranged in the following format with the headings given below!
- Problem definition
- Obtain the data
- Exploratory data analysis and preprocessing
- 3.1: Missing data
- 3.2: Few thoughts and possible reasons for missing data
- 3.3: Techniques to deal with the missing data
- >> Listwise deletion <<
- >> Pairwise deletion <<
- >> Single imputation methods <<
- >> Model-based techniques — Advanced <<
- 3.4: Complete case analysis
- 3.5: Data preprocessing
- 3.6: Dealing with the missing data
- 3.7: Creating interaction terms — feature engineering
- 3.8: Creating dummies
4. Model training and evaluation
- 4.1: Grid search
- 4.2: Best model evaluation
- 4.3: Model coefficients
- 4.4: ROC curve specificity vs sensitivity
5. Communicating the answer — presentation, reports and/or deployment
6. To do
*******************************************************************
1. Problem definition
CKD is one of the leading cause of death around the globe and cost significant amount of money to the global health care system. One of the major challenge for CKD is that it usually don’t show symptoms and can damage kidneys silently. People with early kidney disease may not know anything is wrong. They can’t feel the damage before any kidney function is lost. It happens slowly, and in stages. Early detection with the right treatment can slow kidney disease from getting worse.
So this is a real world problem and experts are spending resources to develop a medical diagnosis test that is better than our current diagnosis system for CKD. The existing clinical data from the CKD patients could be vital role to develop a machine learning algorithm that can predict CKD in high risk individuals (diabetes, high blood pressure, family history and older than 65 years) at its early stage. Most of the times, three simple laboratory tests measuring the amount of waste in the blood, protein in urine and blood pressure are conducted for screening.
>>Data science problem could be:<<
Develop a machine learning algorithm that can predict the early stage CKD with high accuracy (reduces both the number of false positives and the number of false negatives).
Let’s work with the dataset on CKD, this dataset was shared by Dr. P. Soundara Pandian, who is a Senior Consultant Nephrologist.
As always, the required imports are a first step while coding!
import pandas as pd;import numpy as np
import matplotlib.pyplot as plt;import seaborn as sns
sns.set_style(‘whitegrid’) # just optional!
%matplotlib inline#Retina display to see better quality images.
%config InlineBackend.figure_format = ‘retina’# Lines below are just to ignore warnings
import warnings; warnings.filterwarnings(‘ignore’)
*******************************************************************
2. Obtain the data
Let’s read the data directly from the GitHub, there are several other datasets for your practice!
ckd=pd.read_csv(“https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/chronic_kidney_disease.csv")
There are 25 column {24 features (11 numeric, 14 nominal) and 1 class (the target)}
in the dataset.
👉 01: age
- age (numerical) years
👉 02: bp
- blood pressure (numerical) mm/Hg
👉 03: sg
- specific gravity (nominal) (1.005,1.010,1.015,1.020,1.025)
👉 04: al
- albumin (nominal) (0,1,2,3,4,5)
👉 05: su
- sugar (nominal) (0,1,2,3,4,5)
👉 06: rbc
- red blood cells (nominal) (normal,abnormal)
👉 07: pc
- pus cell (nominal) (normal,abnormal)
👉 08: pcc
- pus cell clumps (nominal) (present,notpresent)
👉 09: ba
- bacteria (nominal) (present,notpresent)
👉 10: bgr
- blood glucose random (numerical) mgs/dl
👉 11: bu
- blood urea (numerical) mgs/dl
👉 12: sc
- serum creatinine (numerical) mgs/dl
👉 13: sod
- sodium (numerical) mEq/L
👉 14: pot
- potassium (numerical) mEq/L
👉 15: hemo
- hemoglobin (numerical) gms
👉 16: pcv
- packed cell volume (numerical)
👉 17: wbcc
- white blood cell count (numerical) cells/cubic millimeter
👉 18: rbcc
- red blood cell count (numerical) millions/cubic millimeter
👉 19: htn
- hypertension (nominal) (yes,no)
👉 20: dm
- diabetes mellitus (nominal) (yes,no)
👉 21: cad
- coronary artery disease (nominal) (yes,no)
👉 22: appet
- appetite (nominal) (good,poor)
👉 23: pe
- pedal edema (nominal) (yes,no)
👉 24: ane
- anemia (nominal) (yes,no)
🎯 25: class
- class (nominal) (ckd,notckd)
*******************************************************************
3. Exploratory data analysis (EDA) and preprocessing
At this stage of this course, you must have learned the importance of EDA, always spend time, understand the data and discuss with domain experts for clarification.
Indeed, Exploratory Data Analysis is a key and we also want to check the distributions of each variable in our dataset. There might be a case that a variable/s have overwhelmingly taken a certain value, that may not be useful for predictions.
Do you think a variable with all 1 or 0 have a predictive power?
From the dataset information, we also notice that several variables have blood in their name, is there any relationship between them?
If certain set of features are correlated, we might think about including interaction terms for them or removing one while building a linear model. (recall, logistic regression is a class of linear models)
👉 Anyhow, lets focus on one of the main topic in this lecture and learn the ways to handle missing data (off-course, the best strategy is to re-collect what is missing however, it is not that easy in most of the cases and we end up handling the missing data in some claver way)!
# head and info are useful functions, try yourself!
ckd.head()
ckd.info() # you will see, there is missing data in your cmd dataset!# Try describe as well
ckd.describe() # summary statistics, excluding NaN values
3.1: Missing data
There are missing values in the dataset (you will see using info() and head()). It’s better if we see the numbers as %
of missing data and think about strategies based on domain knowledge.
<<Remember — Domain knowledge is important>>
Usually, in the clinical datasets, missing data also include such cases where tests are not required because of some stated reasons, e.g. did not meet the certain threshold .....
Patterns in the missing data are useful, it's very important to understand and explore the reasons for such patterns.
✅✅ Wait, this article I wrote sometime ago could be helpful to understand the importance of missing data “Survivorship Bias — A Danger Zone (Think again, do you really want to miss what is missing!)”
In the dataset, we can also see that the classes are not balanced, out of 400 observations, we have 250 ckd
and 150 notckd
. The ckd:notckd
is 3:5
, notckd
are around 37.5%
of the total data. Well, our focus is to learn to handle missing data, this much imbalance is manageable and should not a big issue if we could manage to fix the missing data. A good start is to think about splitting the data for train and test with 60:40.
<<TO DO>> Later, you can think about oversampling the minority class using different algorithms that we learned in the previous lecture and see if we can further improve the model.
Well, let’s move on and look at the percentage of missing data first!
Each column have some missing data (other than class, its the target), particularly, rbc
, wbcc
, & rbcc
columns have more than 25% of missing data in each column.
Typically, any column with over 10% or say 15% of the missing data is a concern, it should be addressed to avoid problem and for the better performance and reliability of machine learning model.
3.2: Few thoughts and possible reasons for missing data
Missing data issue is not easy to leave it as is. We must understand the reasons to know if we really want to miss what is missing. Sometimes, missing data could be very useful data! — A good read.
There could be several reasons for the missing data:
- Data is intentionally missing as a part of data collection
- Random data collection issues
- Some social and/or a natural processes
- dropout, death, graduation
- Certain pattern
- certain tests needs to be done for only those who have the age above threshold
- questions that are only relevant if the participant is married and female
- Refusal/no-response from the participant to answer
Having said, domain knowledge is a key in data science process pipeline, specially during the data collection process. There are high chances that we already have some idea on the probability of missing certain features. We can sometime think about filling them using the information from other available features.
Based on the business understanding, if we already know the probability of missing for a certain features, it is always good idea to collect the ones that can help to overcome this issue (a backup plan!) >> Refine the data collection mechanism …… !
- A typical example include, people with high income may not be happy to report the numbers, however, variable/s such as the years of education and/or number of investments could be helpful for indirect estimate!
3.3: Techniques to deal with the missing data
The rule of thumb is start by using what you know about and think about the best analysis strategy that yield the least bias estimates in your data. Talk to the domain experts on your strategies and refine them based on their advices.
These are few common strategies to handle the missing data:
>>🧘🏻♂ listwise deletion (complete case analysis)🧘🏻♂️<<
- We drop a complete observation if any data is missing, which is an assumption of MCAR (Missing Completely At Random) and leave bias in the dataset. This also affects the statistical power which relies on high sample size as well.
- It is simple and provide options to compare the analysis across the features.
>>🧘🏻♂️pairwise deletion🧘🏻♂️<<
- We keep all the available cases in which our variable of interest is present.
- Our sample could be different for different variables and would be difficult to compare the analysis because of sample differences every time.
>>🧘🏻♂️Single imputation methods🧘🏻♂️<<
- mean, median or mode substitution — a kind of complete case analysis. It ignores the relationships between variables and weakens covariance and correlation estimates. Such assumptions can reduce variability in the data.
- dummy variable control (1 is missing 0 if available) — One of the advantage is we use all the available information, however, this results in biased estimates (in case, there is a legitimate skip, these is no bias)
- simple regression — Well, we use the information from available data and the model fit to estimate the missing data. This could overestimates the model fit and correlation estimates along with weakening the variance.
>>🧘🏻♂️Model-based techniques — Advanced🧘🏻♂️<<
- Maximum likelihood — the value that most likely to be observed by identifying a set of parameters that produces the highest log-likelihood
- Multiple imputation using specified regression models — n repetitions results in separate dataset every time — — more accurate variability — cumbersome coding
- Model based clustering
Want to know more, search for model-based imputation methods and you will find nice publications.
- Missing data: Our view of the state of the art — 2002 is one of the great read by Joseph L. Schafer and John W. Graham. 2nd link for pdf copy
- Scikit-learn provides useful modules to impute the missing values as well!
Also good to search for these terms to understand some terminology for the missing data — MCAR (Missing Completely at Random), MAR (Missing at Random), and NMAR (Missing not at Random)……! You can also find explanation in any data mining book!
I think, this is enough on missing data and ways to handle it.
3.4: Complete case analysis
Let’s look at our ckd
data and see how the data look like if we try using listwise deletion -- A complete case analysis!
Well, the above numbers are not very encouraging. We can’t really trust on our model trained on complete cases only as we are losing more than 60% of the data in this situation. The class balance got bad as well, now we have ~27.2%
of the ckd
class, this was (250/400)*100 = 62.5%
in the original data. Actually, most of the data loss is from ckd
class (100*(250-43)/250 = 82.8%)
.
It is never advised to lose data and in this situation it is actually losing lots of data. we must consider the expert opinion, what if there are some meanings of the missing data in certain column!
To Do: — Try to train your model for complete case analysis. You need to create dummy variables for categorical features before you train the model. You can also try oversampling the minority class in a separate try.
3.5: Data preprocessing
We can do extensive EDA to understand the data well. At the moment, let’s focus more on preprocessing and think about handling the missing data, feature engineering, interactions and creating dummies …… etc…
Let’s start with converting the target (class column) to 0/1
from notckad/ckd
. We can also change name of the class
column to target
, class is a keyword in python and we should not be confused.
3.6: Dealing with the missing data
Well, this is time consuming and little complex, let’s stay simple and think about filling the data to create our first prototype. We can always refine our models later on for better performances, a continuous process!
The quick solution is to use Single imputation methods, we know we have two different classes, we can fill the numeric columns with the mean of each class and the categorical columns with the mode of each class.
We need to separate the data into to dataframes, it would be simple and we can later merge them in one again!
# separating data based on the target class
target_1 = ckd[ckd[‘target’]==1]
target_0 = ckd[ckd[‘target’]==0]
Let’s fill the missing values in all numeric columns with their means!
# This will fill all the numeric columns with mean
target_1.fillna(target_1.mean(),inplace=True)
target_0.fillna(target_0.mean(),inplace=True)
We can re-confirm now and see if there is still missing data in the numeric columns?
Let’s grab the list of all categorical columns, and fill the missing data with most frequent value — the mode.
So, we have filled all the missing values according to our defined stratagy. Let’s concatenate the dataframe for both classes now!
ckd_filled = pd.concat([target_0,target_1])
3.7: Creating interaction terms — feature engineering
Considering that red and white blood cell counts (rbcc & wbcc) are related, we can create interaction terms for them. Similarly, there could be a correlation between pus cells are pus cell clumps (pc & pcc), we can also think about creating interaction term form these two columns as well. (One can at-least do such visual scanning based on limited understanding and domain knowledge, however expert opinion is a key for data understanding.)
rbcc
and wbcc
both are numeric columns, we can simply create their interaction column by multiplication -- rbcc x wbcc
.
# interactions for red and white blood cell count
ckd_filled[‘wbcc_rbcc_interaction’]=ckd_filled[‘wbcc’]*ckd_filled[‘rbcc’]
pc
and pcc
both are categorical columns, we can create a new column with a condition that it will have 1
if the pus cells are abnormal and clusters are present, otherwise 0
.
3.8: Creating dummies
Let’s separate the features into X
and the target in y
and then get the dummies for all categorical features.
<<To DO>> Please note, we have created the interaction terms above, however, we are not using them in our model to create a first prototype, we are dropping them while get features in
X
. You can try re-training your model with all columns and compare the results, A Good Practice....!
It is recommended NOT TO REMOVE any data without the consultation of subject matter specialist, as a naive to the field, we are going to include all data (other than the interaction terms and keeping their original columns only) for our first prototype. If we see the evidence of over-fitting, we can play with the features and re-train the model for better generalization.
<<To do>> Scaling the features is important and a good practice is to save the scaling transformation, we have learned this in our previous lectures and I will leave it for you as to do work. See if you can make some improvements in your model’s performance.
*******************************************************************
4. Model training and evaluation
We already know that we have class imbalance in the dataset, but it is not very extreme. We can try with 40:30 split for test and train part.
4.1: Grid Search
In the previous lecture, we tried to find the best values for the selected parameters using for loop, however, the standard way is to do the grid search. Scikit-learn conveniently provide a module for this purpose — GridSearchCV.
We will learn more about the grid-search along the way, specially in the SVMs. Let’s try to find the best parameters for our logistic regression algorithm.
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV # need this import
Let’s define parameters for our grid-search in a dictionary.
Now, we need to create instance of GridSearchCV and pass the estimator (LogisticRegression in this case) along with the grid we have defined above. We can either call out typical fit() function for model training at the end or in a separate line, I am doing everything in one go!
So, we got the best set of parameters for our selected model. The accuracy score can also be seen in the above output.
Let’s move on and evaluate out best model, that we found in grid-search!
4.2: Best Model evaluation
Let’s grab the best estimator and evaluate its performance in test data.
from sklearn import metrics# the best model from the gridsearch
best_model = gs_results.best_estimator_ print(“\nThe test dataset:\n”)
print(“Accuracy score is:”, best_model.score(X_test,y_test))
print(“Classification report:”)
print(metrics.classification_report(y_test,best_model.predict(X_test)))#For confusion matrix, lets grab the individual numbers this time for tn, fp, fn, and tp !#Tuple unpacking to get tn, fp, fn, and tp
tn,fp,fn,tp=metrics.confusion_matrix(y_test,best_model.predict(X_test)).ravel() #ravel() is useful function, read its documentation!
print(“True Negatives: “ + str(tn))
print(“False Positives: “ + str(fp))
print(“False Negatives: “ + str(fn))
print(“True Positives: “ + str(tp));
Well, instead of grabbing the best model (as above), we can also use best parameters and re-train a model instance.
Let’s grab the parameters, and train model instance using best parameters.
So, the final model will be LogR, that we can serialize (how, click here). (Remember, you must use complete dataset (X, y) not the train part only!)
4.3: Model coefficients
(Reminder from the previous lecture) — Despite being a relatively simple model, logistic regression is widely used in real world problems. The coefficients are interpretable and we can understand how feature/s X
affects the target y
. Another advantage of logistic regression is that it usually usually does not suffer from high variance due to the large number of simplifying assumptions placed on the model. (i.e. features are "linear in the logit," errors are independent and follow a Bernoulli distribution -- a discrete distribution having two possible outcomes.....etc.)
Let’s try to look at the model coefficients.
So, a unit increase in serum creatinine (sc
) is an individual is ~1.19
times more likely to be CKD positive, while all other coefficients remains unchanged. Similarly, higher albumin (al
) and blood pressure (bp
) are also at higher risk.
4.5: ROC curve — specificity vs sensitivity
Moving forward, we may want to find some combination of specificity and sensitivity. Its ROC analysis where the tradeoff between specificity and sensitivity is explored as a trade off between TPR and FPR (that is, recall and fallout).
- Focusing only on sensitivity (probability of positive test given that patient has a disease) means we minimize false negatives. This means there will be few people we incorrectly predict to be healthy, but more people we incorrectly predict to be sick. A negative result in a test with high sensitivity is useful for ruling out disease.
- Focusing only on specificity (probability of negative test given that patient is well) means we minimize false positives. This means there will be few people we incorrectly predict to be sick, but more people we incorrectly predict to be healthy. A positive result in a test with high specificity is useful for ruling in disease.
Let’s work with the unseen data to the model (test part in this case).
# predicting probabilities for the test data
prob_test_set = logR_.predict_proba(X_test)
print(“Probabilities are computed in variabe ‘prob_test_set’”)# Computing Area Under the ROC Curve from prediction scores.
AUC_ROC = metrics.roc_auc_score(y_test, prob_test_set[:,1]) #<shift + tab> to learn more#prob_test_set[:,1] will gives the probabilities for all 1 ([0.5 to 1.0]) and 0 ([0 to 0.49])
print(‘Area Under ROC Curve: %.3f’ % AUC_ROC)# Computing Receiver operating characteristic (ROC)
fpr,tpr,thresholds=metrics.roc_curve(y_true=y_test,y_score=prob_test_set[:,1])
print(“frp, tpr and probability thresholds are computed…!”)
print(“\nPlotting ROC curve……!”)# Let’s get the ROC plot now!
plt.figure(figsize = (18,6))#setting the figure size
print(“Setting the plot size to (18,6)”)# plot no skill — A blue dotted line on the plot for random guess
print(“Plotting random guess…..!”)
plt.plot([0,1],[0,1],ls=’ — ‘,lw=3,label=’Random guess’)# plot the roc curve for the model
print(“Plotting ROC ….!”)
plt.plot(fpr, tpr, marker=’.’, label = ‘ROC — Area Under The Curve: %.3f’ % AUC_ROC)
print(“Setting plot limits, titles and x-, y- labels….!”)# let’s set the limits (0,1)
plt.xlim([0, 1])
plt.ylim([0, 1])# good to put title and labels
plt.title(‘ROC curve for the Logistic Regression model (logR) for test data.’,fontsize=20)
plt.ylabel(‘TPR — Recall — Sensitivity’,fontsize=16)
plt.xlabel(‘FPR — Fall-out — (1 — Specificity)’,fontsize=16)# putting the legends
plt.legend(fontsize=18)
print(“ROC plot is ready……!”);
I feel like the sensitivity is more important, I would rather tell the patient that they are at high risk for CKD (CKD positive) and be wrong at the end than telling them that they are healthy and putting them in danger of having sick! Expert opinion are advised in this analysis.
Some suggestions might be useful:
- Optimizing sensitivity is better than optimizing specificity
- Optimizing
f1-score
is a good option as it is a combination of sensitivity and specificity - (little advance skills required) One can optimize a custom metric that weighs sensitivity somewhat more importantly than specificity
- Careful analysis of ROC curve is helpful to find a place where sensitivity is very high and (1 — specificity) is pretty low……!
What do you think about the above options?
*******************************************************************
5. Communicating the answer — presentation, reports and/or deployment
Well, plots are ready, you have answers to the questions for your client……….What next?
Indeed, the key and essence of any project is presenting outcomes to technical and not technical people for advise and their business goals. This is the most important skill and needs good understanding of the whole data science pipeline along with presentation skills. We must focus and train ourself with test/dummy presentations. Once can also improve these skills by attending the workshops.
*******************************************************************
6. To do
- Try re-training the algorithm with computed interaction terms (you can drop the involved columns in a separate try to compare)
- Try feature scaling
- Try oversampling the minority class using different techniques
*******************************************************************
Recall, the lecture on cross-validation, we use such techniques to find the best pipeline and once we finalize it, we must use all the available data to get well generalized model for production. Want to recall how to save and load the final model, please read the previous articles!
*******************************************************************
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A35: K-Nearest-Neighbors (KNN)-Behind The Scene!”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
**************************************************************************************************************************************
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.
**************************************************************************************************************************************