A10: Pandas (Part-3): Merging, Combining & GroupBy
This article is a part of “Data Science from Scratch — Can I to I Can” series.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
So, it’s time to talk about data wrangling and aggregation:
Let’s learn how to Merging/Joining, Combining/Concatenation, and use GroupBy method for this purpose.
Merging and combining datasets:
Merging and combining are common operations that we perform on our data from different tables. Data contained in pandas objects, such as dataframes, can be combined together in a number of ways using pandas built-in methods:
merge():
connects rows in DataFrames based on one or more keys. (This will be familiar to SQL or other relational databases users, as it implements database join operations).concat():
concatenate or "stacks" together objects along an axis.
Let’s learn these methods with examples, however, before moving forward, you may want to give a quick look at the figure below on join operations (also knows as SQL joins).
✅ If you don’t know SQL, don’t worry, the concepts of merging are presented with very simple examples so that you can follow the steps. Although, our focus here is not to learn SQL, we only want to go through the widely used and few very important inner
and outer
joining operations for data wrangling.
If you have questions, please ask and we are more than happy to help!
✅ Important thing you should know: Merging operation may give NaN
in the output and they needs to be treated according to the circumstances/requirements during data analysis.
Database-Style DataFrame joins: Merge or join operations combine datasets by linking rows using one or more keys.These operations are central to relational databases (e.g., SQL-based).
Let’s import pandas library and so that we can use its built-in methods.
We need data to work with, let’s create two DataFrames, df1
and df2
for dictionaries, it's super easy!
Its’ always good to see how our data look like!
The above datframes, df1
and df2
looks great. Before we move on, let's explore merge()
method first.
We can type pd.merge
and press <shift+tab>
in the jupyter notebook's code cell to see the documentation.
You will notice that there are several parameters that we can pass to the merge()
method, the most important ones are how
and on
, that we will discuss here.
how
tells themerge()
, what type of joining operation needs to be done, it could beinner, outer, left, right
. Default value ofhow
isinner
, if nothing is provided.on
tells the field name to join on, which could be a label or a list.
merge( )
Let’s overview how
and on
parameters in merge()
.
how:{inner, outer, left, right}
inner:
useintersection of keys
from both frames, similar to a SQL inner join.outer:
useunion of keys
from both frames, similar to a SQL full outer join.left:
useonly keys from left
frame, similar to a SQL left outer join.right:
useonly keys from right
frame, similar to a SQL right outer join.
on:label or list
-- which is a field (Column or index level) names to join on, and must be found in both DataFrames.
Let’s explore all possible values of how
with on = key
, which is present in both dataframes, df1
and df2
.
how ='inner'
The key
column, in our merged dataframe will be the intersection of the key
columns from both df1
and df2
. In our case, a b c
along with associated data with them.
☞ For comparison, I am using print
to see the resultant merged dataframe along with the original dataframes df1
, df2
.
☞ In the merged dataframe, d e
did not appeared. This is inner
join and will only returns the intersection of key
columns!
how = 'Outer'
In this case, the key
column in our merged dataframe will be the union of df1['key']
and df2['key']
, all the value in key
column found in both tables.
☞ Now, in the merged dataframe, we got NaN
in A2
and B2
columns for d
and e
indexes. As it is a Union operation, and A2
, B2
values does not exist in our df2
for indexes d
, e
, hence missing data will appear with NaN
!
how ='left'
In this case, only key
column of the left dataframe will be used. (similar to a SQL left outer join.)
☞ We are getting NaN
for indexes d e
in the columns A2 B2
, the indexes d e
don't exist in df2['key']
column!
how = 'right'
left
operation is similar to the right
operation, however this will use only key
column of the right dataframe this time. (similar to a SQL right outer join.)
Good to know!
There are cases when we need to merging our data on more than one keys, let try using two key columns
(key1, key2)
For this purpose, I am going to create two new dataframes withkey1 key2
columns in each of them. To make things even simpler, I am going the name themleft
andright
.
inner
isintersection
, only the key pair present in both dataframes will appear in the resultant.
As we know, outer
is union
, all key pair present in both dataframes will appear in the resultant.
For left
join, the key pair in left will be used only.
For right
join, the key pair in right will be used only.
Concatenation:
Concatenation is interchangeably referred as binding, or stacking as well. This operation basically glues togetherDataFrames.
☞ It’s important to remember that dimensions should match along the axis, we are concatenating on.
We can use pd.concat
and pass in a list of DataFrames to concatenate them together.
To understand the process, let's create two simple dataframes, with the given indexes.
Let’s concatenate both dataframe df1
and df2
along the rows, axis=0
.
What is we concatenate along columns, axis=1
. Notice missing data in the output below.
Joining — Good to know!
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
Once again, let’s create two dataframes for this operation and learn-by-doing!
This concept of joining could be little tricky for beginners, however, practice will make the difference. Please revise for better understanding.
Groupby( )
Groupby is one of the most important and key functionality in pandas. It allows us to group data together, call aggregate functions and combine the results in three steps split-apply-combine:
Before we move on to the hands-on, let’s try to understand how this split-apply-combine work, using a data in different colours!
- Split: In this process, data contained in a pandas object (e.g. Series, DataFrame) is split into groups based on one or more keys that we provide. The splitting is performed on a particular axis of an object. For example, a DataFrame can be grouped on its rows (axis=0) or its columns (axis=1).
- apply: Once splitting is done, a function is applied to all groups independently, producing a new value.
- combine: Finally, the results of all those functions applications are combined into a resultant object. The form of the resulting object will usually depend on what’s being done to the data.
Lets explore with some examples. I am going to create a dictionary and convert that into pandas dataframe for this purpose, a good revision, right!
In the DataFrame df
, we have a Customer (a unique name), Sales in numbers and store name.
Let's group the data, in df
, based on "Store"
column using groupby
method. This will create a DataFrameGroupBy
object.
Grab the dataframe "df"
, access "gropby"
method using "." (dot operator)
and pass the column that you want to group your data on.
Notice, we get a groupby
object, stored in a memory 0x.......
Let’s save the created object as a new variable.
Now, we have grouped data in "by_store"
object, we can call aggregate method on this object (grouped dataframe).
Pandas will apply mean()
on number columns "Sales"
. It ignore not numeric columns automatically. Same is True for sum
, std
, max
, and so on......
Notice that, the grouped dataframe is a dataframe with "Store"
as index and "Sales"
as column. We can use loc
method to locate any value for certain company after aggregation function. This will give us the value (e.g. sales) for a single store.
We can perform whole lots of aggregation operations on "by_store"
object.
describe()
is a useful method, that gives a bunch of useful information, such as, mean
, min
, quartile
values etc for each company.
Let’s use transpose()
after describe()
so that the output looks good!
Along with describe()
and transpose()
, we can grab the statistics for any store using its name (which is column) from transposed dataframe as well!
It’s good to know that we can grab individual groups from the grouped data as well. get_group()
is a useful function for this purpose!
Just to explore, trying different available option!
Great work guys! So, this was all about Data Wrangling (merging/joining, combining/concatenation) and data aggregation using GroupBy at the moment.
Your clap and/or share can help us to reach to someone who is struggling to learn these concepts.
Good luck!
See you in the next lecture on “A11: Useful Methods”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn or through ScienceAcademy.ca.