A9: Pandas (Part-2): Hierarchical Indexing (MultiIndex) & Handling Missing Data

Junaid Qazi, PhD
7 min readMar 30, 2020

This article is a part of “Data Science from Scratch — Can I to I Can” series.

Click here for the previous article/lecture on “A8: Series & DataFrames — Index & Slicing”.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

Hello guys,
In the previous article, we have learned about pandas data structures. Let’s move on and talk about hierarchical indexing (MultiIndex) and handling missing data in this article.

Hierarchical Indexing (MultiIndex):

Hierarchical or Mulit-level Indexing is another very important feature in pandas, which makes it possible to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way to work with higher dimensional data in a lower dimensional data structures like 1-D Series or 2-D DataFrame.

Let’s start with a simple example for using Series and learn-by-doing:

I am going to start with creating a nested list "multi_index". We will use this list to create a multi-level index for a series.

Let’s create a series "ser" with multi-level index (2 indexes in this example).

With a hierarchically indexed object, so-called partial indexing is possible, which enables the concise selection of subsets of the data. For example, grabbing a for ser which is in level 1 index, will return all the data points under the umbrella of a along with the related level 2 index.
Let's try!

If, we want to grab a single value, we need to go down to the the index level 2. Here is an example!

Well, most of the time we will be working with DataFrame.
Let's create a DataFrame now!, With a DataFrame, either axis (row or column) can have a hierarchical index.

I am going to create a dataframe with multi-level index for rows only. To refresh our knowledge, let’s use arange() and reshape() here.

Now, the question is: How to index the above dataframe, df!

  • on the columns axis, just use normal bracket notation df[].
  • on row axis, we use df.loc[]

Calling one level of the index returns the sub-dataframe, let’s try to grab a sub-dataframe based on row index a.

What if we want to grab a single value, well, the idea is to go from outside to inside
To grab “11”, we need to use loc to grab the sub-dataframe for index b and then loc again to grab the required value, which is 11 in this case!

The hierarchical levels on indexes can have names (as strings or any Python objects). If so, these will show up in the console output:

Looks like df does not have names of its index levels. We can give names to the indexes, say "L_1, L_2" in this case. Its super easy!

Now, the above dataframe looks much better, as it has the respective names of its row indexes.

Good to know!

xs( )

Let me introduce a very useful and built-in method “xs()" to grab data from multilevel index.
xs() has ability to go inside a multilevel index.

If we want to grab all the data in df where index L_2 is "1", it's tricky for loc method, xs will do the magic here!
For Example:
Tell xs() that we want all the rows where L_2 has value/key equal to 1!

Handling Missing Data

Missing data is very common in many data science applications, and pandas has a great ability to handle such situation. Most of the times, missing data is also referred as NA or NaN.
Let's learn some convenient methods to deal with missing data in pandas:

Let’s start by creating a dataframe with missing data, we can use numpy for adding missing data (np.nan)!

To check the missing data in any dataframe, isnull(), notnull() are very useful methods. Let's see how they work.
(Please notice the difference for True and False in the outputs below!)

So, isnull() returns True / False, bool type. Let's confirm using dtypes after isnull() on any column.

Well, our dataframe "df" has only four columns and five rows, we can even count the number of missing values in each column.
What if we have thousands of rows in our dataset and we want to know how much data is missing! We can easily find out number of missing values in each column. We can use sum() after isnull(), and each True/False will be considered as 1/0. This will returns total count of missing values in each column.

Well, knowing number of missing values is fine, however, it’s convenient to think about the missing data in terms of percentages as given below.

If we want to find column or row sum, all NaN's are considered as "0" by sum().

pandas ignore NaN while computing means!

dropna(), fillna() are useful methods to clean and fill the missing data, however it is important to know some points.
By default, dropna() will drop all the rows with any number of NaN value. Default value of axis=0, which is for rows!

We can pass axis=1 to dropna(), this tells the function to drop all the columns with any NaN value!

thresh is a very useful parameter in dropna(). It is int type with default value is None
thresh = 3 means, Keep only the rows with at least 3 non-NA values (please note default axis is rows).

We can pass column axis (axis = 1) along with the thresh as well to put a condition on columns instead.

Real world datasets are messy, and it is common to have missing data. The best way is to find out the ways to fill the missing values in your datasets. pandas provides a function fillna() to fill in the missing data according to our set rules/choice. Let's explore how does this work.
☞ Remember! we need to set inplaced = True if we want a permanent change in our dataframe 'df', I am not passing this parameter as I don't want a permanent change at the moment (recall from the previous lecture).

Let’s fill in the values in column A with it's mean value.

method is another useful parameter, let's try!

method = pad or ffill: Forward fill, the last valid observation will be propagated to the next valid NaN.

method = bfill or backfill -- use NEXT valid observation to fill the gap

Great work guys! So, this was all about hierarchical indexing (MultiIndex) and handling missing data at the moment. We will go through other important concepts in the next articles.

Please clap and share this article in your contacts, someone might be struggling to learn these concepts.

Good luck!

See you in the next lecture on “A10: Data Wrangling: Merging/Joining, Combining/Concatenation”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn or through ScienceAcademy.ca.

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.