A28: Dummy Variables >> Dealing with Categorical Features!

Quantitative vs Qualitative, Creating dummies in pandas, Redundant variables, Interpret model coefficients of dummy variables >> Hands-on with complete working code…!

Junaid Qazi, PhD
9 min readJan 4, 2022

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A27: Bias-Variance Trade-off >> Spotting the Sweet-Spot!!”

⚠️ This is a learning lecture, and benchmark dataset is used for the said purpose.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

💐💐Click here to FOLLOW ME for new contents💐💐

Categorical features >> Creating dummies

We are going to cover following topics in this lecture/article:

  1. Quantitative vs Qualitative Data
  2. The tips data from seaborn
  3. Creating Dummies
  4. Redundant Variables
  5. Machine Learning
  6. How to interpret the model coefficients of dummy variables
  7. To Do
  8. Readings

Let’s start with important imports and check their versions…!

A quick version check >>good to use the same versions to focus on learning😊

1. Quantitative vs Qualitative Data

(Numerical — discrete and continuous Vs Categorical — nominal and ordinal)

So far, we have worked with couple of datasets, and all of them had numerical feature/predictor variable (X), however, the data can include quantitative and/or qualitative variables as well!

Let’s start with few important concepts regarding data variables:

Quantitative data contains numerical variables and they can be:

  • Discrete — can only take certain values, a complete digit / a finite number of possible values —

>>>>>>students: {10, 20, 30}

>>>>>>deaths: {1, 5, 6}

>>>>>>patients: {100, 400, 1000}

>>>>>> we can’t say 10.5 students or 1.5 deaths…..

  • Continuous — potentially this type of data can have infinite possible values (digit or float) —

>>>>>>weight: {1, 1.1, 3.5, 3.5555555}

>>>>>>price: {10, 10.50, 50.25}

Qualitative data, also called categorical data contains categorical variables which defines some characteristic. Categorical variables comes in:

  • Nominal — an unordered list of categories —

>>>>>>gender: {male, female}

>>>>>>time: {dinner, lunch}

>>>>>>blood_group: {A, B, AB, O}

  • Ordinal — range of ordered values along a scale —

>>>>>>disease_stage: { mild/1, moderate/2, advanced/3}

>>>>>>star_rating: {1, 2, 3, 4, 5}

>>>>>>degree_of_pain:{none/0, mild/1, moderate/2, severe/3}

>>>>>>grade:{poor, fair, good, excellent}

As far as plotting is concerned, we have already explored range of options for creating categorical plots using seaborn from the first part of this course in (Data Science from Scratch — Part 1),. However, we can’t perform mathematical operations on this type of categorical data, we must deal with them before we feed them to our machine learning model.

We can create dummies for the categorical variable, present in our dataset, let’s move on and see how to do this!

We can work with a tips dataset, which is a part of seanborn.

2. The tips data from seaborn

So, we have 244 observations and 7 columns. 4 columns are categorical (sex, smoker, day, time).

Having said, we can’t perform mathematical operations on categorical variables, let’s see what happen if we try to compute correlations on tips dataset (all features included).

So, do you see categorical columns (non-numeric) in the above dataframe on correlations?

So, the categorical columns are no included, the correlations can only be calculated between the numerical variables.

3. Creating Dummies

We can convert categorical variable into dummy/indicator variables and this can be conveniently achieved using pandas — pd.get_dummies module!

Let’s start with a single column and create it’s dummies, we can select day!

So, we have four days.

Rather then having a column day and its value as {'Sat', 'Sun', 'Thurs', 'Fri}, what if we have four columns (one for each day) and put 1 for the day in respective observation and 0 for all other day columns....!

We are going to do this using pandas dummy module!

So, after creating dummies (dummifying) for day, we got 4 day columns, only one column will get 1 for individual observation, right?

Newly created dummy column for Friday…..Friday appeared only 19 times in the day column, right?

We have 1 for 19 observations (datapoints) in the day_Fri column, and before creating dummies, Fri appeared 19 times in day column, this makes sense!

There is another point we need to consider, do we need four columns for the days?

Technically, if {day_Thur, day_Fri, day_Sat} columns have 0, it is obvious that {day_Sun} will get 1 and the vice versa.

Do we really need four days columns or three will work?

Well, we can actually drop one column as having three will serve the purpose. This could be huge savings in several way (storage, memory, compute….etc) if we are working with a data with billions of observations, and yes it is common to have such datasets!

We can get our task done with only setting one parameter: drop_first=True and it will return k-1 dummies, where k is the number of unique values in passed categorical column (day in this case)...

Let’s try!

Notice that, NOW, we only have three day columns with dummies and using drop_first=True!

Let’s move on a create dummies for all categorical variables and keep all the dummy columns.

Try yourself in your own notebook to see the output…

Remember, if we don’t pass the list for columns in pd.get_dummies(), it will automatically create dummies for all the categorical/object columns (e.g. sex:{male,female}), however there could be a categorical column with numeric coding (e.g. star rating:{1,2,3}), we MUST EXPLICITLY TELL pandas to transform such column in dummies as well.....it will not be done automatically!

Let’s get the correlation heatmap for our dataframe after dummies!

Correlation heatmap after dummifying categorical columns, can you figure out the redundant columns?

4. Redundant variables

We have used drop_first parameter to remove the unnecessary variable column for day.

If you look at the above heatmap, we can clearly see anti-correlations for binary variable (sex, smoker, time).

For the day column, we have more than two categories and one is only true at a time. The heatmap() is computing pairwise correlations and similar anti-correlation will not be reflected in it for the day column, however, it is obvious that if three days are 0, forth must be 1..... we don't need extra column in all the cases, these are redundant.

  • It’s recommended to drop the redundant variables at the first place, otherwise, Lasso reduces them to zero even with a mild regularization strength!

Let’s drop all the redundant variables and look at the heatmap again!

What is the difference between this and the above heatmeap with all columns? Read text for explanation.

Well, the redundant column are no more in the data. Saturday and Sunday shows strong negative correlation.

5. Machine Learning — predicting the tip amount

Let’s try to predict the amount of tip, and compare the coefficients of linear regression, lasso and ridge.

Here is a link to the range of available linear models in sklearn.linear_model!

The code below will give us three trained regression models on the given dataset and plot model coefficients as a single bar plot for comparisons.

For the first two variables on left (total_bill and size), we can see three bars, all the bars for lasso are not appearing for remaining features, guess why?

We can see Lasso reduces coefficients for many column to zero, thinking that they are not important. In case, we have all the redundant columns, they will also get zero coefficients by lasso. As discussed above, it is much better to avoid having redundant variables at the first place, if we keep the redundant columns in the data, we will not have much control which one is driven to zero and it may end up getting into many further problems.

Lasso Regression exclude the useless variables from the equation, it could be considered little better than the Ridge at reducing variance in model that contain a lot of useless variables.

Finally, we can see that the total bill and the group size are the most important factors for the amount of the tip. This makes sense, people usually give the till as a % of their bill…..we have already explored this dataset in the first part of this course, try creating more plots to understand the data well before training any model.

Let’s move on and learn how to interpret the model coefficients in the presence of dummy variables.

6. How to interpret the model coefficients of dummy variables

Let’s look at the coefficients dataframe.

The trained model will have the contribution to the outcome variable e.g. for female/male:

0.015539 is rounded to 0.02 for the “LinearReg_coef” from the above dataframe.

The main purpose of this lecture was to understand how to dealt with the categorical variables, I hope this would be helpful!

⚠️ <<<<A very important note>>>> You must deal with the missing data (NaN) before you create dummies, that column may not be encoded properly!

>>>> By the way, do you really want to miss what is missing? A good read!

7. To Do

There is a dataset auto_data.csv in the practice github repository.

  • Look for the columns that needs dummies, and think why?
  • See if there is missing data and deal with that

If you want to do little Machine Learning, you can try predicting millage per gallon (mpg) after creating reasonable dummy variables!

So we have 304 unique names, do you think we need this much number of dummy columns?

<<<A TIP:>>> It does not make sense to create as many dummy variables as close to the observation you have in your data. You can rather create a new category (e.g. auto_brand). You may also see some spelling mistakes, this is where your skills in basic python will work, map, split, replace etc........!

8. Readings

All done so far, Good Luck!

💐💐Click here to FOLLOW ME for new contents💐💐

<<Keep practicing to brush-up and add new skills.>>

Excellent work!

Your clap and share can help us to reach to someone who is struggling to learn these concepts.

Good luck!

See you in the next lecture on “A29: Logistic Regression (Part-1) >> Theory Slides/lecture”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn.

--

--

Junaid Qazi, PhD
Junaid Qazi, PhD

Written by Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.