A31: Logistic Regression >> Dead or Alive >> Step-by-step complete machine learning project!
Exploratory data analysis, model training, validation, evaluation, building ROC-Curve, model explanation using LIME & SHAP…and much more … A step-by-step complete tutorial using logistic regression…!
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
⚠️ Benchmark dataset is used for learning purpose.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
Logistic Regression — Dead or Alive
Complete project from exploratory data analysis, building machine learning model to evaluation and explaination.
- The dataset
- Exploratory data analysis — EDA
- > 2.1: Visualize the missing data
- > 2.2: Know more about the data — Asking questions
3. Getting data ready for machine learning — Data preprocessing
- > 3.1: Data cleaning
- > 3.2: Dealing with categorical features — Creating dummies
- > 3.3: Good to know (explore yourself) — ColumnTransformer, make column selector, Pipeline
4. Train and test datasets
5. Feature scaling — Standardization
6. Building machine learning model
- > 6.1: Model training
- > 6.2: Regularization review
- > 6.3: Predictions and evaluation
- >> Classification report
- >> Confusion matrix
- 6.4: Predicting probabilities instead of class
- >> Receiver operating characteristic — The ROC-curve
- 6.5: Saving the model
- >6.6: Feature importance
- >> Regression coefficients
- >> Coefficient and odd ratios
- >> Permutation feature importance
7. Model Explainability
- 7.1: LIME
- 7.2: SHAP
8. To do
- 8.1: Recommended readings
Welcome back guys,
From the previous lectures, we have deeper understanding of Logistic Regression, and we also know how to implement this model using thescikit-learn
library from python data science ecosystem.
Let’s move on and explore one of the most famous and benchmark dataset of Titanic Disaster from the history. This dataset is considered a first step towards classification in machine learning.
1. The Dataset
In the titanic dataset, we have following features. As the goal is to predict if the passenger survived or not, the target variable will be "Survived"
column.
Data Dictionary:
👉 PassengerId
👉 Pclass — Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
👉 Name — Passenger name
👉 Sex — male / female
👉 Age — age in years
👉 SibSp — no. of siblings / spouses aboard the Titanic
👉 Parch — no. of parents / children aboard the Titanic
👉 Ticket — Ticket number
👉 Fare — Passenger fare
👉 Cabin — Cabin number
👉 Embarked — Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton
🎯 Survived — 0 = No, 1 = Yes
Remember, the goal here is to predict if a passenger survived sinking of Titanic or not.
⚠️ Please note, we have two separate data file for this project, training dataset (X_train, y_train) and the test dataset (X_test, y_test).
It might be a good idea to start with some exploratory data analysis (EDA), understanding the data is very important for any project. After EDA, we will train a Logistic Regression model on training part of the dataset for classification. We will then use the trained model to make predictions for the test dataset, which will be unseen to the model. A separate file on targets for the test data is also given in GitHub data repository, we can then see the model performance for unseen data!
Let’s start our journey!
First thing first, let’s import some libraries. At this stage, I am sure these libraries are not new to you!
import pandas as pd; import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid') # just optional!
%matplotlib inline#control the scaling of plot elements
sns.set_context('paper',font_scale=1.7)#Retina display to see better quality images.
%config InlineBackend.figure_format = 'retina'# Lines below are just to ignore warnings
import warnings; warnings.filterwarnings('ignore')
Let’s read the training part of the dataset in train
train=pd.read_csv(‘https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/train_titanic_Xy.csv')
2. Exploratory Data Analysis — EDA
Let’s overview the dataset using info()
first!
2.1: Visualize the missing data
So, we have 891 entries in our train dataset with column Name
along with other information of the traveler such as passenger class (Pclass
), Fare
, Ticket
and Cabin
etc.
Notice:
Age
column have 714 non-nulls whereasCabin
have 204 non-null valuesEmbarked
also have 889 non-nulls
So there is some data missing!
Let’s do some calculation to find out the % of missing data in each column! Remember, we have a function isnull()
in this situation!
We have the numbers now!
Cabin
column is missing 77.1% of its dataAge
column is missing 19.9% of its dataEmbarked
column is missing 0.2% of its data
Recall and refresh your skills in dealing with missing data, we are going to use those skills at later stage
isnull()
returnsTrue
for all places where the data is missing. Our dataset is not very small, we better think about graphical visualization usingseaborn's heatmap
method to visualization missing data!
Let’s try!!
# heatmap using seaborn, you can set the figure size if you want!
plt.figure(figsize = (18,6))
sns.heatmap(data = train.isnull());
The above plot might be ok, but visualization of our heatmap can be improved. yticklabels
are overlapping and the color bar is also not useful in this case. We can set yticklabels
and cbar
to False
and also use cmap = 'viridis'
for cleaner map (you can use color of your choice)!
Let’s try again!
The heatmap looks much better now. Yellow places are the Trues
that represent missing data in the respective column.
2.2: Know more about the data — Asking questions
Well, we want to know more about the dataset
We can use countplot()
to see how many people survived and how many died!
plt.figure(figsize = (18,4))
sns.countplot(x=’Survived’, data=train);
# try different palette, such as ‘coolwarm’ or anyother!
It’s sad that not many passengers survived! Let’s dig little deeper, we can use our EDA skills that we have learned in the previous lectures in this course, pass hue = Sex
to see the female and male ratio in survived and died passengers.
The plot suggests that, among the dead, most of them were male. The survival ratio for female to male was higher as most of the females survived.
>> We can ask another question here!<<
We know there were three passenger classes in the titanic, which class survived the most?
nunique() or unique()
on Pclass
and hue = Pclass
can be useful to answer this question!
Just a comment for the next countplot
, you can use any color e.g. palette='coolwarm'/'rainbow'
etc, it's your choice, I am just trying to keep things simpler using default one!
Well, we got even better understanding of our data. Now, we know that more than half of the class-1
passenger survived whereas most of the class-3
passengers died.
Rate of survival was higher for the
class-1
passengers! make sense?
Let’s explore more and see what was the survival count
based on the Port of Embarkation
?
Looks like passenger, embarked from Southampton Port (S
) have a better chance of survival!
Well, the survival could was higher for S
port, however, the rate of survival was higher for the port C
.
Here comes another question, we may want to explore further to see the class of the passenger with respect to the port of embarkation.
It’s super easy, just pass hue = 'Pclass'
now!. This is again a countplot
!
plt.figure(figsize = (18,4))
sns.countplot(x=’Embarked’,data=train, hue=’Pclass’)
plt.legend(loc=1);
You may notice that, more understanding of the data gives better in-sights!
Now, we see, Southampton was actually the busiest port for each class! We can expect more people to survive, however, the rate of survival was higher for the port C
.
Furthermore, we can see how many passengers traveled with siblings/spouses and parent/children. We can plot a histogram to know how the age was distributed among the travelers.
I encourage you to ask questions to yourself and try to apply your EDA skills to learn more about the data you are working with. Even, for this titanic dataset, you may want to use different types of plots including the interactive ones, to get better insights.
Let’s get few more plots and then move on to the next step in which we will be getting our data ready for the machine learning model.
We will do some data cleaning and will Convert categorical features to dummy variables using pandas >>recall the previous lecture on dealing with categorical features<<
- What do you learn from the plot above?
- Is there any trend with survival with the group size?
Moving forward, I just want to have another plot to see the age distribution of the passengers on titanic!
plt.figure(figsize = (16,4))
sns.distplot(train[‘Age’].dropna(),kde=False,color=’green’,bins=30); # try without dropna(), what do you see?# the line below >> using pandas data visualizations
#train[‘Age’].hist(bins=30,color=’green’,alpha=0.5)
It’s enough EDA for now, because our goal it to train a machine learning model. Before that, we must transform the data in a formate that is acceptable for our algorithm.
Remember, understanding data is extremely important. Now a days, our data are not horizontal and vertical only, most of the time we are working with much higher dimensional data and we need range of plotting options to know more about the relationships among features.
3. Getting data ready for machine learning — Data preprocessing
3.1: Data Cleaning
So, we know form EDA that some data is missing in our dataset, let’s deal with that first.
Age
column is missing ~ 19.9% of its data.
- A convenient way to fix
'Age'
column is by filling the missing data withmean
oraverage
value of all passengers in that column. We can do even better in this case, because we know that their are three passenger classes, its better to use the average age for each missing passenger for its own class.
Let’s use a boxplot()
to visually explore if their is any relationship in class and passenger age?
Yes, Pclass
and Age
are somehow related, this makes sense, older the passenger is, higher the class he traveled in!
So our hypothesis to to fill the missing Age
with respect to the passenger class is the better way to fill in missing data in Age
column!
We can write a function and use apply()
from pandas for this task, however, before writing a function, we may want to know the average age of passengers for each class, groupby()
could be usefull here!
Let’s find average age of passengers in each class first, we only need Pclass
and Age
columns for this purpose!
Now, we have average age for each class, let’s write a custom function to fill the missing values in Age
columns. Super easy, we can use if-else conditional statement
in the function!
Let’s apply the above function to our data now. We can use apply()
method and pass axis = 1
for column. (recall from pandas section)
Let’s try to re-plot the heatmap now, after fixing the age column!
So, we got this done, no more yellow color in the Age column. This means we have filled all the missing values in Age
column using impute_age
function.
Now, there is another column,Cabin
with ~ 77.1% of missing data.
77% is lots of information! Well, we might be able to analyze the ticket number to see if we can get some information on the Cabin, however, let’s leave it at the moment and simply drop this column.
# dropping ‘Cabin’ column, axis =1 for column and inplace = True for permanent change!
train.drop(‘Cabin’,axis=1,inplace=True)
Let’s see how the heatmap
looks like now!
plt.figure(figsize = (16,6))
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap=’viridis’);
So, we don’t have Cabin
Column in our data now, only yellow colour is for Embarked
column. This is only 0.2%
We can either fill this missing data value in Embarked
port with the most frequent one or drop the row. Well, dropping row will not harm here, its only a very small fraction.
Let’s recreate the heatmap and see how it looks like now!
Great! We don’t have any missing data in our dataset now!
But, we are not done yet, we need to deal with the Categorical Features now!
3.2: Dealing with Categorical Features — Creating dummies.
From the previous lecture, we already know that we can use pandas built-in method pandas.get_dummies
, to convert categorical variable into dummy/indicator variables.
This is important because:
- In case of
string
variables (e.g.gender/sex:{male, female}
), Machine Learning model will them as input to work with. - For
ordinal
variable (e.g.star_rating: {1, 2, 3, 4, 5}
), number shows a category like 1 for poor, 2 for good, 3 for fair, 4 for better and 5 for excellent, there must be dummified.
We already know, how the process works, let’s start with the Sex
column.
In the Sex
column we have female/male
they are strings and we can't do mathematical operation on them. We can actually create a column representing 0 for female and 1 for male
, this process is creating dummies.
==>
GOOD TO KNOW >> Please exploreOneHotEncoder()
from scikit-learn, it encode categorical features as a one-hot (aka ‘one-of-K’ or ‘dummy’ encoding scheme) numeric array. It's good idea to explore the differences and usage of pandas_dummies and scikit-learn's OneHotEncoder. The ultimate goal is to handle categorical features for model training.
Notice that, we get a dataframe for every single category with 0 or 1 as an indicator value (a column for female and a column for male here). Actually, only one column is a perfect predictor for the other in this case, right?
For example, if female is 0 then it is obvious that male is 0 and vice versa. Passing both columns to the algorithm is not useful as they are redundant. Actually, our algorithm will immediately know that if there is "0"
for female
then I can perfectly predict it's going to be "1"
for the male
.
- Recall the concept of Multi-Collinearity, and this will mess up with our Machine Learning algorithm, because some columns will be the perfect predictor for the other columns. What if we have 20 categorical predictors and we need to create dummies for all, we will have lots of redundant data!
We know that we can avoid these redundant column to avoid Multi-Collinearity, just set drop_first = True
in get_dummies()
and all done!
Let’s perform these two operations now:
- Create dummies for
Sex
,Embarked
andPclass
columns - Add them to the dataframe using concatenate operation (a good review on pandas essentials)
So, we got the needed dummy columns, as a replacement for Sex
and Embarked
, and we don't need the original columns anymore. We also decide not to use Ticket
and Name
columns at the moment, so we can drop ['Sex','Embarked','Name','Ticket', 'Pclass']
columns.
HINT: Feature Engineering is very useful process, we can actually create new features from the existing columns, for example, in this data, we can grab the first letter in the Ticket, prefix the Name (Mr., Mrs. etc) or last_name and so on. We can also try to explore if there is any clue for the cabin in the ticket number!….
Let’s drop Sex, Embarked, Ticket, Name, Pclass
columns for our dataframe/dataset.
Well, the dataframe looks good, all columns are numerical now!
Actually, we can drop PassengerId
column as well. This column is essentially just an index that starts at 1. Although, it is a numerical column but not very useful for us to predict if the passenger survived or not.
Let’s do it!
3.3 Good to know
In the above few cells, we have converted categorical features into dummies and then concatenated then in a final dataframe. Scikit-learn provides a module
ColumnTransformer()
that can be very helpful for the above implementation and also scaling the features independently in one go. The module allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space. This link of scikit-learn would be helpful Column Transformer with Mixed Types.We can actually build a pipeline with steps to preprocess the selected columns such as, dummies/one-hot-encoding, scaling, data imputing strategy and much more including th model that we want to train on the data. Anyhow, the most important thing at the moment is understanding the ways to get solid foundations, you can always explore the ways to combine strategies with less coding efforts once you know how the things work. We can even write our own custom module or pipelines to combine several processes according to our project need, and it is common practice in professional setup.
These are other useful links to explore: make_column_selector and Pipeline
4. Train and test datasets
Typically, we split the data into train
and test
part using train_test_split()
.
# Importing required method from sklearn
from sklearn.model_selection import train_test_split
# Let's keep the default size and states
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
However, we have separate files for the training and test datasets. We will work with tow separate datasets here. Normally, this is how we work with the real life projects.
Remember, you must perform all the preprocessing on the test part of the data (same as you do with the test part).
>>Train part (X_train, y_train)<<
Let’s separate features in X_train and target in y_train. Survived is our target column whereas all other are features in train
(the full dataset)!
5. Feature scaling — Standardization
So we have separated the feature in X_train
, always good to standardized them. Another good practice is to save the transformation and use it for the test data.
Let’s standardized the the features in X_train
and save the transformation as well to use for test data.
from sklearn.preprocessing import MinMaxScaler #StandardScaler
import pickle # need this import
scaler = MinMaxScaler() #StandardScaler() # Creating instance ‘scaler’
scaler.fit(X_train) # fitting the features# Saving the transformation
pickle.dump(obj=scaler, file=open(file=’transformation.pkl’, mode=’wb’))# Loading saved transformation
scaler = pickle.load(file=open(file=’transformation.pkl’, mode=’rb’))# transforming features
X_train_s = scaler.transform(X_train)
The X_train_s
is a numpy ndarray, we can convert that into a dataframe if we want, however, it will work as is as well!
Now, we have scaled features for training data and the targets in (X_train_s, y_train)
.
Let’s move on and train our model!
All good now!
Our data is ready to train our Machine Learning model for classification.
Just a quick review on what we did:
- fixed the missing data issue
- created dummies for categorical features
- dealt with Multi-Collinearity issue
- dropped the columns we don’t need
At this stage, I am sure that you have realized how important it is to do the Exploratory Data Analysis!
Let’s move on to build and train Logistic Regression Model for our data!
6. Building machine learning model
Our dataset is ready for building our Logistic Regression Model.
6.1: Model training
To train our logistic regression model, we need to import LogisticRegression
from linear_model
family in sklearn
and the we need to create its instance! (same as we did in linear regression part)
n_jobs=-1
means using all processors andverbose
parameter will set the verbosity!
As from its name, max_iter
will set the number of iterations, we will talk about other tow that we have used in a while, they are related to regularization!
If you explore the documentation of logistic regression, you will see range of parameters, we can actually find out the best combination of the parameters with grid search, we will talk about it later. You can comeback and do the optimization.
So, Our model is trained on the given (X_train_s, y_train)
data.
6.2: Regularization review
Before we move on for predictions, let’s have a quick talk about the parameters "C"
and "penalty"
.
In the figure above, we have an example of under-fitted and over-fitted classification models on left and right respectively. The middle one is generalized and a good model for unseen data and this is where penalty
term makes the difference. Recall Regularization or penalty
term, which penalizes, specifically the "large" weight coefficients and discourages learning a more complex or flexible model. This avoids the risk of over-fitting.
It’s important to understand that we don’t want our model to memorize the training dataset, we want a model that learns the patterns and generalize well to new and unseen dataset.
l2
or Ridge is a default regularization in scikit-learn for logistic regression. The other one is l1
, the lasso regularization. The difference between the l2
and l1
is just that l2
is the sum of the square of the weights, while l1
is the sum of the weights. (Recall the lecture on regularization)
Now, C
parameter, this is inverse of regularization strength, smaller values of C
specify stronger regularization.
So, once again in a very simple language:
Regularization is an act of modifying a learning algorithm to favor simpler prediction rules and to avoid over-fitting. Most commonly, regularization refers to modifying the loss function to penalize certain values (specifically the large ones) of the weights our model is learning.
Well, I think this should be enough on regularization review, if you want to explore more, please revise the previous lecture on regularization and suggested material.
Let’s move on and get the predictions for our test dataset from trained model logR
.
6.3: Predictions and evaluation
Let’s read test data from the provided links.
# Reading test (X_test, y_test) from Git repository
X_test=pd.read_csv(‘https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/test_titanic_X.csv')
y_test=pd.read_csv(‘https://raw.githubusercontent.com/junaidqazi/DataSets_Practice_ScienceAcademy/master/test_titanic_y.csv')
Let’s join them, its good, in case, we drop a row from X_test for some reason, we must drop the respective y_test as well. Joining will make things simpler!
So, to get the prediction, all the preprocessing that we carried out on training data, needs to be done for the test data, right? (Think why?)
Let’s treat the test data in the same way, we treated training data.
# All the preprocessing that we did with the train part — must be done!# Impute age
test[‘Age’] = test[[‘Age’,’Pclass’]].apply(impute_age,axis=1)# dropping cabin column
test.drop(‘Cabin’,axis=1,inplace=True)# dropping any missing ro
wtest.dropna(inplace=True)# Getting dummies
sex_emb_test = pd.get_dummies(test[[‘Sex’, ‘Embarked’]], drop_first=True)
p_class_test = pd.get_dummies(test[‘Pclass’].astype(str), drop_first = True)# Concatenating dummy cols with test
test = pd.concat([test,sex_emb_test,p_class_test], axis=1)# Dropping the columns accordingly
test.drop([‘Sex’,’Embarked’,’Name’,’Ticket’,’PassengerId’,‘Pclass’],axis=1, inplace=True)# Separating features and the target
X_test = test.drop(‘Survived’, axis = 1)
y_test=test[‘Survived’]#(Recall, we saved the transformation for feature standardization, we can load that saved transformation and used to standardized the test data as well)# Scaling features >> Loading saved transformation
scaler = pickle.load(file=open(file=’transformation.pkl’, mode=’rb’))
X_test_s = scaler.transform(X_test) # transforming features# Creating dataframe for test feature
X_test_s = pd.DataFrame(X_test_s, columns=X_test.columns)
X_test_s.head(2)
All done in the above one code cell, so (X_test_s, y_test)
is ready for model evaluation!
Test score is higher than the training score, as far as our amount of efforts are concerned, the results are not bad, however there is always room for improvements.
>>Classification report<<
Evaluation process has its own importance, we want our model to be as good as possible in predictions. We have learned that scikit-learn
provides a very nice and efficient way to evaluate classification tasks using its classification_report
module.
Let’s import this module and use for evaluation.
from sklearn.metrics import classification_reportprint("*****************************************************")
print("Report on training data:")
print(classification_report(y_train,pred_train))
print("*****************************************************")
print("*****************************************************\n")
print("Report on test data:")
print(classification_report(y_test,pred_test))
print("*****************************************************")
The classification report tells us about precision, recall, f1-score and support cases for each class along with their averages.
>>Confusion matrix<<
Its all up to us, instead of classification report, we are more interested in the confusion matrix to calculate any specific value, we can get that one as well (recall previous lectures).
Let’s get the confusion matrix using scikit-learn
. We need to do another import.
🧘🏻♂️ Optional: 🧘🏻♂️
Always nice to present your results self-explanatory. The above confusion matrix can be presented in a nice looking coloured dataframe using simple code.
Note >>> Despite being a relatively simple model, logistic regression is widely used in real world problems. The coefficients are interpretable and we can understand how feature/s X
affects the target y
. Another advantage of logistic regression is that it usually does not suffer from high variance due to the large number of simplifying assumptions placed on the model. (i.e. features are "linear in the logit," errors are independent and follow a Bernoulli distribution -- a discrete distribution having two possible outcomes.....etc.)
6.4: Predicting probabilities instead of class
Using of trained model (logR), we have predicted the class for each datapoint. However, we know that there is a probability associated with each class.
Most of the time, specially in predicting the disease, we may want to look for the probabilities for each class as well. These probability values will give us more control on the results and we can even calibrate the threshold accordingly.
Recall logistic regression theory, our default cut-off line is the probability of 0.5 (e.g. class 0 for a probability value between
[0.0 to 0.49]
and class 1 for a probability value between[0.5 to 1.0]
). What if we want to re-adjust this probability threshold and tune the behavior of our model for a specific problem, say we want 1 for those with probability in[0.7 to 1.0]
. We can enforce this once we know the probabilities.
Let’s try to look at the probabilities now!
# predicting probabilities for the test data
prob_test_set = logR.predict_proba(X_test_s)#, predictions
We have the predicted probabilities in prob_test_set
and predicted class in pred_test
.
Let’s create a dataframe with class probabilities and the predicted class.
Using the code below, you can use to code below just to see the first 10 values if you don’t want to create a dataframe
print("Predicted class for the first 5 datapoints in test set:")
print(pred_test[0:10])
print("\nPredicted probabilities for first 5 datapoints in test set:")
print(prob_test_set[0:10])
In the above output, we can clearly see that the class is predicted based on the probability above and below 0.5. (Typically, the probability above 0.5 are considered for 1, however, it does not really matter)
Ok, this is cool, we know how to predict class and get the respective probability value.
Let’s move on a see how the cut-off value, the probability threshold, will affect the accuracy score and the confusion matrix — False Positive Rate (fpr) & True Positive Rate (tpr) values!
Look for the lowest value of fpr
, where tpr
is maximum.....in this case, its already at default probability cut-off!
I want to move on and introduce another concept here, ROC-Curve. (quick recall on the theory lecture is required)
>>Receiver operating characteristic — The ROC-curve<<
From the above dataframe, let’s plot fpr
against tpr
.
The plot above is called ROC curve! for different probability cut-offs, we get the confusion matrix and compute fpr & tpr to plot this curve
ROC curve is a graphical plot that describe the diagnostic ability of a binary classifier. We create the ROC curve by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings.
- TPR is also known as
sensitivity
,recall
orprobability of detection
, whereas, the - FPR is also known as the
fall-out
orprobability of false alarm
and can be calculated as(1 - specificity)
andspecificity = tn / (tn + fp)
which is same as TNR -- True Negative Rate.
We can think about ROC curve “the sensitivity as a function of fall-out”.
The ROC curve can be a great tool to compare different models for different thresholds and the area under the ROC curve can be used as a summary of the model performance, however, it does mean that a larger area under the ROC curve (AUC) is usually better.
Scikit-learn provide built-in and easy options to get ROC curve, let’s create one for the current logistic regression model!
# Computing Receiver operating characteristic (ROC)
fpr,tpr,thresholds=roc_curve(y_true=y_test,y_score=prob_test_set[:,1])
variable thresholds
have all the probability cut-offs (decision boundary) for which fpr
and tpr
are computed.
The first value thresholds[0]
represents no instances being predicted and is arbitrarily set to max(y_score) + 1
. Read the documentation to learn about other parameters, such as pos_label
(positive class label) and drop_intermediate
(default set to True to create lighter ROC curve with less datapoints), we usually don't need them however it is good to know the available options.
# Let’s get the ROC plot now!
plt.figure(figsize = (16,6))#setting the figure size
# plot no skill — A blue dotted line on the plot for random guess
plt.plot([0,1],[0,1],ls=’ — ‘,lw=3,label=’Random guess’)# plot the roc curve for the model
plt.plot(fpr, tpr, marker=’.’, label = ‘ROC — Area Under The \
Curve: %.3f’ % AUC_ROC)# let’s set the limits (0,1)
plt.xlim([0, 1])
plt.ylim([0, 1])# good to put title and labels
plt.title(‘ROC curve for the Logistic Regression model (logR) \
trained on current dataset.’)
plt.ylabel(‘TPR / Sensitivity / Recall’)
plt.xlabel(‘FPR / Fall-out’)# putting the legends
plt.legend(fontsize=18);
Good to know few points on ROC Curve:
- ROC curves are typically used in binary classification to study the output of a classifier or a machine learning model.
- If we want ROC curve for a multi-class classification problem, we need to binarize the output to draw one ROC curve per label.
- Ideal Point is the top left corner of the plot, where we have
FPR=0
andTPR=1
. - Steeper ROC curves maximize the
TPR
while minimizing theFPR
. - A good trained and a skillful model will assign higher probabilities to randomly chosen (unknown data-points) real positive occurrences than a negative occurrences on average. In general, such models are represented by curves that bow up to the top left (Ideal Point) of the ROC plot.
Smaller values along X-axis in ROC plot indicate lower False Positives and higher True Negatives (don’t confuse TRP with True Negatives — one is rate and the other is a count). Larger values on the Y-axis of the plot indicate higher True Positives and lower False Negatives.
Sensitivity vs Specificity: (a recap)
Sensitivity/true positive rate/the recall/or probability of detection
measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).Specificity/true negative rate
measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).
6.5: Saving the model
We have our trained model as logR
object now. Same as in the previous lecture on linear regression, we want to save the model on our disk so that we can load this trained model and use it any time at later stage.
Nothing new and should be straight forward now!
Let’s do this.
Let’s load the saved model and make the predictions again, the results should be same on the same data if the model is not changed or retrained on the new days with different conditions!
Compare the results! Now you know how to save the trained model, load it and use if for the unknown data!
We are done with our classification project using Logistic Regression. I hope you enjoyed the journey and everything is clear.
In this project, we actually performed lots of steps that we usually do as a Data Scientist in our real life projects. We clean the data and transform it into an acceptable format so that we can feed it as an input to train our machine learning algorithm.
You may have noticed that the Machine Learning libraries are quite simple and straight forward to use. We spend significant amount of time in data cleaning and preparation. And yes, understanding the working principal and theory behind the model is very important, optimization takes lots and lots of time in the real life projects!
6.6: Feature Importance
🧘🏻♂️>>Regression coefficients<<🧘🏻♂️
If we look at the linear models (linear regression, logistic regression and their regularization extensions), we actually find a set of coefficients (𝛽
values)) for features in the training dataset and then the weighted sum (coefficient_1 x feature_1 + …..) help us to make the predictions.
Technically, these coefficients (the 𝛽
values) can provide a basis for a crude way of scoring the respective feature importance, which guide us to the usefulness of certain feature for predicting the target values. Given that the features are on the same scale.
We can get the coefficients of our trained models and create a beautiful histogram, let’s get one for our model logR
Let’s explore how the model worked while training using (X_train_s, y_train)
.
In this classification problem, we have two classes 0/died and 1/survived. Now, the coefficients have positive and negative values, the ones with positive scores indicate that the respective feature is important to predict class 1/survived, and the ones with negative scores indicate the feature/s that predicts class 0/died.
If we look at the fare, it got the positive values, its a numeric variable and the interpretation is that all else being equal, the person is more likely to be survived if he has paid the higher fare! — What do you think about the Pclass
of such traveler?
🧘🏻♂️>>Coefficient and odd ratios<<🧘🏻♂️
So we can get the odds ratio by exponentiating the coefficient of logistic regression……!
It’s easier to exponentiate the logistic regression coefficients and present them as odd ratios…. interpret coefficients — odd ratios in logistic regression
So, from the plot above, the increase of one unit in price give ~1.5
times more likely that the passenger would survive while all other parameter held unchanged. Similarly, we can think about the survival for male/female.
- If we have multiple features in our logistic regression model with no interaction terms among them, each exponentiated coefficient returns the estimated ratio of two odds. We can interpret this as a change in odds in the multiplicative scale for a unit increase in the corresponding feature (coefficient x feature) while holding all other features at certain value.
- On the other hand, if our logistic regression model have interaction terms between features (e.g. X1
x
X2), the interpretation of the coefficients become little tricky. The interaction term/s actually attempts to describe how the effect of one feature depends on the value of another feature, and this makes the interpretation more involved in this situation. This link would be helpful here!
🧘🏻♂️>>Permutation feature importance<<🧘🏻♂️
sklearn.inspection.permutation_importance()
-- Permutation feature importance is a model inspection (careful examination) technique that can be used for any fitted estimator when the data is tabular.
==>
The concept here is quite simples, we measure the importance of a feature by calculating the increase in the model's prediction error after permuting the feature. This link could be helpful for further details.
Good to visualize the number from above dataframe as a bar plot!
The above Permutation importances plot shows that the sex is significantly important feature to make predictions.
- If we compute permutation importances on test dataset (the held-out testing or validation set), it make it possible to highlight which features contribute the most to the generalization power of the inspected model. Features that are important on the training set but not on the held-out set might cause the model to overfit. link
Features that are deemed of low importance for a bad model (low cross-validation score) could be very important for a good model. Therefore it is always important to evaluate the predictive power of a model using a held-out set (or better with cross-validation) prior to computing importances. Permutation importance does not reflect to the intrinsic predictive value of a feature by itself but how important this feature is for a particular model. Reference
7. Model Explainability
Well, we are already familiar with LIME and SHAP from our previous lectures, Let’s try to get some explanation on how the model is making predictions.
7.1: LIME (Local-Interpretable-Model-Agnostic-Explanation)
# Let’s see how the model came up to the decision
from lime.lime_tabular import LimeTabularExplainerexplainer = LimeTabularExplainer(training_data=X_train_s, feature_names=X_train_s.columns,class_names=[‘died/0’, ‘survived/1’], discretize_continuous=False,)lime = explainer.explain_instance(X_obs.values[0], logR.predict_proba)#,num_features=10)lime.show_in_notebook()
7.2: SHAP (SHapley-Additive-exPlanations)
# Importing and initialization java script for visualization from SHAP, we can use matplotlib as well!
import shap
shap.initjs() # jave script
So, the values in red are helping the model to make prediction of class 0 (died) in this observation, whereas the blue are reducing the chances to predict class 0 and helping to predict class 1.
# Try this interactive force plot
#shap.force_plot(base_value=explainer_shap.expected_value,shap_values=shap_values,features=X_train)
8. To do
Considering the amount of data and time we have used in this project, the results are very good. The can be improved with more data and adding more features.
Few thing that you may want to consider while practicing:**
- Well, we considered
Pclass
as a categorical column and created its dummies, try to re-train the model with the originalPclass
column without dummies and compare your results. What are you findings and why the results are different? - Do you think that you can get any information from the
Ticket
or any other column. - Grab the prefix/title (Mr. Mrs. Dr. etc) from
Name
as a feature
Titanic dataset is very popular for Classification problem and their are number of good kernels on kaggle. Check the Python ones, you may get different ideas to improve your model.
The kernels in other languages such as R are also useful, you can get an idea on data cleaning, plotting and some feature engineering that you can implement in Python as well.
8.1: Recommended Readings
- Amazing explanation oN Logistic Regression — Why sigmoid function?
- Why is logistic regression considered a linear model?
- Feature Importance
All done so far, I hope you have created your own notebook and now can use your own data to train a logistic regression classification model……!
💐Click here to FOLLOW ME for new contents💐
🌹Keep practicing to brush-up & add new skills🌹
✅🌹💐💐💐🌹✅ Please clap and share >> you can help us to reach to someone who is struggling to learn these concepts.✅🌹💐💐💐🌹✅
Good luck!
See you in the next lecture on “A32: Multi-class Classification using Logistic Regression!”
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a subject matter specialist, data science & machine learning consultant, and a team builder. He is a professional development coach, mentor, author, technical writer, and invited speaker. Dr. Qazi can be reached for consulting projects, technical writing and/or professional development trainings via LinkedIn.