A28: Dummy Variables >> Dealing with Categorical Features!
Quantitative vs Qualitative, Creating dummies in pandas, Redundant variables, Interpret model coefficients of dummy variables >> Hands-on with complete working code…!
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
⚠️ This is a learning lecture, and benchmark dataset is used for the said purpose.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
Categorical features >> Creating dummies
We are going to cover following topics in this lecture/article:
- Quantitative vs Qualitative Data
- The tips data from seaborn
- Creating Dummies
- Redundant Variables
- Machine Learning
- How to interpret the model coefficients of dummy variables
- To Do
- Readings
Let’s start with important imports and check their versions…!
A quick version check >>good to use the same versions to focus on learning😊
1. Quantitative vs Qualitative Data
(Numerical — discrete and continuous Vs Categorical — nominal and ordinal)
So far, we have worked with couple of datasets, and all of them had numerical feature/predictor variable (X), however, the data can include quantitative and/or qualitative variables as well!
Let’s start with few important concepts regarding data variables:
Quantitative data contains numerical variables and they can be:
- Discrete — can only take certain values, a complete digit / a finite number of possible values —
>>>>>>students: {10, 20, 30}
>>>>>>deaths: {1, 5, 6}
>>>>>>patients: {100, 400, 1000}
>>>>>> we can’t say 10.5 students or 1.5 deaths…..
- Continuous — potentially this type of data can have infinite possible values (digit or float) —
>>>>>>weight: {1, 1.1, 3.5, 3.5555555}
>>>>>>price: {10, 10.50, 50.25}
Qualitative data, also called categorical data contains categorical variables which defines some characteristic. Categorical variables comes in:
- Nominal — an unordered list of categories —
>>>>>>gender: {male, female}
>>>>>>time: {dinner, lunch}
>>>>>>blood_group: {A, B, AB, O}
- Ordinal — range of ordered values along a scale —
>>>>>>disease_stage: { mild/1, moderate/2, advanced/3}
>>>>>>star_rating: {1, 2, 3, 4, 5}
>>>>>>degree_of_pain:{none/0, mild/1, moderate/2, severe/3}
>>>>>>grade:{poor, fair, good, excellent}
As far as plotting is concerned, we have already explored range of options for creating categorical plots using seaborn from the first part of this course in (Data Science from Scratch — Part 1),. However, we can’t perform mathematical operations on this type of categorical data, we must deal with them before we feed them to our machine learning model.
We can create dummies for the categorical variable, present in our dataset, let’s move on and see how to do this!
We can work with a tips
dataset, which is a part of seanborn.
2. The tips
data from seaborn
So, we have 244
observations and 7
columns. 4
columns are categorical (sex, smoker, day, time).
Having said, we can’t perform mathematical operations on categorical variables, let’s see what happen if we try to compute correlations on tips
dataset (all features included).
So, the categorical columns are no included, the correlations can only be calculated between the numerical variables.
3. Creating Dummies
We can convert categorical variable into dummy/indicator variables and this can be conveniently achieved using pandas — pd.get_dummies
module!
Let’s start with a single column and create it’s dummies, we can select day
!
So, we have four days.
Rather then having a column
day
and its value as{'Sat', 'Sun', 'Thurs', 'Fri}
, what if we have four columns (one for each day) and put1
for the day in respective observation and0
for all other day columns....!
We are going to do this using pandas dummy module!
So, after creating dummies (dummifying) for day
, we got 4
day columns, only one column will get 1
for individual observation, right?
We have 1
for 19
observations (datapoints) in the day_Fri
column, and before creating dummies, Fri
appeared 19
times in day
column, this makes sense!
There is another point we need to consider, do we need four columns for the days?
Technically, if {day_Thur, day_Fri, day_Sat}
columns have 0
, it is obvious that {day_Sun}
will get 1
and the vice versa.
Do we really need four days columns or three will work?
Well, we can actually drop one column as having three will serve the purpose. This could be huge savings in several way (storage, memory, compute….etc) if we are working with a data with billions of observations, and yes it is common to have such datasets!
We can get our task done with only setting one parameter: drop_first=True
and it will return k-1 dummies
, where k
is the number of unique values in passed categorical column (day in this case)...
Let’s try!
Notice that, NOW, we only have three day columns with dummies and using drop_first=True
!
Let’s move on a create dummies for all categorical variables and keep all the dummy columns.
Remember, if we don’t pass the list for columns in
pd.get_dummies()
, it will automatically create dummies for all the categorical/object columns(e.g. sex:{male,female})
, however there could be a categorical column with numeric coding(e.g. star rating:{1,2,3})
, we MUST EXPLICITLY TELL pandas to transform such column in dummies as well.....it will not be done automatically!
Let’s get the correlation heatmap for our dataframe after dummies!
4. Redundant variables
We have used drop_first
parameter to remove the unnecessary variable column for day.
If you look at the above heatmap, we can clearly see anti-correlations for binary variable (sex, smoker, time)
.
For the
day column
, we have more than two categories and one is only true at a time. Theheatmap()
is computing pairwise correlations and similar anti-correlation will not be reflected in it for theday column
, however, it is obvious that if three days are0
, forth must be1
..... we don't need extra column in all the cases, these are redundant.
- It’s recommended to drop the redundant variables at the first place, otherwise, Lasso reduces them to zero even with a mild regularization strength!
Let’s drop all the redundant variables and look at the heatmap again!
Well, the redundant column are no more in the data. Saturday and Sunday shows strong negative correlation.
5. Machine Learning — predicting the tip amount
Let’s try to predict the amount of tip, and compare the coefficients of linear regression, lasso and ridge.
Here is a link to the range of available linear models in sklearn.linear_model
!
The code below will give us three trained regression models on the given dataset and plot model coefficients as a single bar plot for comparisons.
We can see Lasso reduces coefficients for many column to zero, thinking that they are not important. In case, we have all the redundant columns, they will also get zero coefficients by lasso. As discussed above, it is much better to avoid having redundant variables at the first place, if we keep the redundant columns in the data, we will not have much control which one is driven to zero and it may end up getting into many further problems.
Lasso Regression exclude the useless variables from the equation, it could be considered little better than the Ridge at reducing variance in model that contain a lot of useless variables.
Finally, we can see that the total bill and the group size are the most important factors for the amount of the tip. This makes sense, people usually give the till as a % of their bill…..we have already explored this dataset in the first part of this course, try creating more plots to understand the data well before training any model.
Let’s move on and learn how to interpret the model coefficients in the presence of dummy variables.
6. How to interpret the model coefficients of dummy variables
Let’s look at the coefficients dataframe.
The trained model will have the contribution to the outcome variable e.g. for female/male:
The main purpose of this lecture was to understand how to dealt with the categorical variables, I hope this would be helpful!
⚠️ <<<<A very important note>>>> You must deal with the missing data (NaN) before you create dummies, that column may not be encoded properly!
>>>> By the way, do you really want to miss what is missing? A good read!
7. To Do
There is a dataset auto_data.csv
in the practice github repository.
- Look for the columns that needs dummies, and think why?
- See if there is missing data and deal with that
If you want to do little Machine Learning, you can try predicting millage per gallon (mpg) after creating reasonable dummy variables!
So we have 304
unique names, do you think we need this much number of dummy columns?
<<<A TIP:>>> It does not make sense to create as many dummy variables as close to the observation you have in your data. You can rather create a new category (e.g. auto_brand). You may also see some spelling mistakes, this is where your skills in basic python will work, map
, split
, replace
etc........!
8. Readings
- Explore the section “Types of Data” (Full course is a great source to learn!)
- Different Type of Data
- LabelEncoder from scikit-learn
- scikit-learn’s labelencoder vs. pandas get-dummies
All done so far, Good Luck!
💐💐Click here to FOLLOW ME for new contents💐💐
<<Keep practicing to brush-up and add new skills.>>
Excellent work!
Your clap and share can help us to reach to someone who is struggling to learn these concepts.
Good luck!
See you in the next lecture on “A29: Logistic Regression (Part-1) >> Theory Slides/lecture”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn.