A10: Pandas (Part-3): Merging, Combining & GroupBy

Junaid Qazi, PhD
9 min readApr 15, 2020

--

This article is a part of “Data Science from Scratch — Can I to I Can” series.

Click here for the previous article/lecture on “A9: Hierarchical Indexing (MultiIndex) & Handling Missing Data”.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

So, it’s time to talk about data wrangling and aggregation:

Let’s learn how to Merging/Joining, Combining/Concatenation, and use GroupBy method for this purpose.

Merging and combining datasets:

Merging and combining are common operations that we perform on our data from different tables. Data contained in pandas objects, such as dataframes, can be combined together in a number of ways using pandas built-in methods:

  • merge(): connects rows in DataFrames based on one or more keys. (This will be familiar to SQL or other relational databases users, as it implements database join operations).
  • concat(): concatenate or "stacks" together objects along an axis.

Let’s learn these methods with examples, however, before moving forward, you may want to give a quick look at the figure below on join operations (also knows as SQL joins).

If you don’t know SQL, don’t worry, the concepts of merging are presented with very simple examples so that you can follow the steps. Although, our focus here is not to learn SQL, we only want to go through the widely used and few very important inner and outer joining operations for data wrangling.
If you have questions, please ask and we are more than happy to help!

Important thing you should know: Merging operation may give NaN in the output and they needs to be treated according to the circumstances/requirements during data analysis.

Database-Style DataFrame joins: Merge or join operations combine datasets by linking rows using one or more keys.These operations are central to relational databases (e.g., SQL-based).
Let’s import pandas library and so that we can use its built-in methods.

We need data to work with, let’s create two DataFrames, df1 and df2 for dictionaries, it's super easy!

Its’ always good to see how our data look like!

The above datframes, df1 and df2 looks great. Before we move on, let's explore merge() method first.
We can type pd.merge and press <shift+tab> in the jupyter notebook's code cell to see the documentation.
You will notice that there are several parameters that we can pass to the merge() method, the most important ones are howand on, that we will discuss here.

  • how tells the merge(), what type of joining operation needs to be done, it could be inner, outer, left, right. Default value of how is inner, if nothing is provided.
  • on tells the field name to join on, which could be a label or a list.

merge( )

Let’s overview how and on parameters in merge().

how:{inner, outer, left, right}

  • inner: use intersection of keys from both frames, similar to a SQL inner join.
  • outer: use union of keys from both frames, similar to a SQL full outer join.
  • left: use only keys from left frame, similar to a SQL left outer join.
  • right: use only keys from right frame, similar to a SQL right outer join.

on:label or list -- which is a field (Column or index level) names to join on, and must be found in both DataFrames.

Let’s explore all possible values of how with on = key, which is present in both dataframes, df1 and df2.

how ='inner'

The key column, in our merged dataframe will be the intersection of the key columns from both df1 and df2. In our case, a b c along with associated data with them.

For comparison, I am using print to see the resultant merged dataframe along with the original dataframes df1, df2.

☞ In the merged dataframe, d e did not appeared. This is inner join and will only returns the intersection of keycolumns!

how = 'Outer'

In this case, the key column in our merged dataframe will be the union of df1['key'] and df2['key'], all the value in key column found in both tables.

☞ Now, in the merged dataframe, we got NaN in A2 and B2 columns for d and e indexes. As it is a Union operation, and A2, B2 values does not exist in our df2 for indexes d, e, hence missing data will appear with NaN!

how ='left'

In this case, only key column of the left dataframe will be used. (similar to a SQL left outer join.)

☞ We are getting NaN for indexes d e in the columns A2 B2, the indexes d e don't exist in df2['key'] column!

how = 'right'

left operation is similar to the right operation, however this will use only key column of the right dataframe this time. (similar to a SQL right outer join.)

Good to know!

There are cases when we need to merging our data on more than one keys, let try using two key columns (key1, key2)
For this purpose, I am going to create two new dataframes with key1 key2 columns in each of them. To make things even simpler, I am going the name them left and right.

inner is intersection, only the key pair present in both dataframes will appear in the resultant.

As we know, outer is union, all key pair present in both dataframes will appear in the resultant.

For left join, the key pair in left will be used only.

For right join, the key pair in right will be used only.

Concatenation:

Concatenation is interchangeably referred as binding, or stacking as well. This operation basically glues togetherDataFrames.

☞ It’s important to remember that dimensions should match along the axis, we are concatenating on.

We can use pd.concat and pass in a list of DataFrames to concatenate them together.
To understand the process, let's create two simple dataframes, with the given indexes.

Let’s concatenate both dataframe df1 and df2 along the rows, axis=0.

What is we concatenate along columns, axis=1. Notice missing data in the output below.

Joining — Good to know!

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
Once again, let’s create two dataframes for this operation and learn-by-doing!

This concept of joining could be little tricky for beginners, however, practice will make the difference. Please revise for better understanding.

Groupby( )

Groupby is one of the most important and key functionality in pandas. It allows us to group data together, call aggregate functions and combine the results in three steps split-apply-combine:
Before we move on to the hands-on, let’s try to understand how this split-apply-combine work, using a data in different colours!

  • Split: In this process, data contained in a pandas object (e.g. Series, DataFrame) is split into groups based on one or more keys that we provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).
  • apply: Once splitting is done, a function is applied to all groups independently, producing a new value.
  • combine: Finally, the results of all those functions applications are combined into a resultant object. The form of the resulting object will usually depend on what’s being done to the data.

Lets explore with some examples. I am going to create a dictionary and convert that into pandas dataframe for this purpose, a good revision, right!

In the DataFrame df, we have a Customer (a unique name), Sales in numbers and store name.
Let's group the data, in df, based on "Store" column using groupby method. This will create a DataFrameGroupByobject.

Grab the dataframe "df", access "gropby" method using "." (dot operator) and pass the column that you want to group your data on.
Notice, we get a groupby object, stored in a memory 0x.......

Let’s save the created object as a new variable.

Now, we have grouped data in "by_store" object, we can call aggregate method on this object (grouped dataframe).

Pandas will apply mean() on number columns "Sales". It ignore not numeric columns automatically. Same is True for sum, std, max, and so on......

Notice that, the grouped dataframe is a dataframe with "Store" as index and "Sales" as column. We can use loc method to locate any value for certain company after aggregation function. This will give us the value (e.g. sales) for a single store.

We can perform whole lots of aggregation operations on "by_store" object.

describe() is a useful method, that gives a bunch of useful information, such as, mean, min, quartile values etc for each company.

Let’s use transpose() after describe() so that the output looks good!

Along with describe() and transpose(), we can grab the statistics for any store using its name (which is column) from transposed dataframe as well!

It’s good to know that we can grab individual groups from the grouped data as well. get_group() is a useful function for this purpose!

Just to explore, trying different available option!

Great work guys! So, this was all about Data Wrangling (merging/joining, combining/concatenation) and data aggregation using GroupBy at the moment.

Your clap and/or share can help us to reach to someone who is struggling to learn these concepts.

Good luck!

See you in the next lecture on A11: Useful Methods”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn or through ScienceAcademy.ca.

--

--

Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.