A11: Pandas (Part-4): Useful Methods and Operations
This article is a part of “Data Science from Scratch — Can I to I Can”series.
Click here for the previous article/lecture on “A10: Pandas (Part-3): Merging, Combining & GroupBy”.
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
So, let’s have a quick look at some of the most important methods that you will be using all the time. Along with describe()
which gives the basic statistics on your data, we have already covered head(), isnull(), dropna(), fillna()
etc.
In this lecture, we will explore other general purpose operations and revise what we have learned in the previous lectures.
Let’s create a dataframe to get hands-on experience on these operations.
Let’s create a dataframe with repeated values along with missing data NaN
.
We can start with what we already know.
info()
: Provides a concise summary of a DataFrame. We will use this function very often in the course.
head(n)
: Returns the first n (default is 5) rows. Its very useful to get the overview on our data. We will use this function very often in the course.
isnull()
: Return a boolean same-sized object indicating if the values are null.
dropna()
: Another useful function that can be used to drop the missing data. The parameter axis
defines row
or column
and its default value is 0
means row
where as axis=1
is for column.inplace
is another very useful parameter, it set True
, any change on the dataframe will be permanent, its default value is set to False. Explore the documentation to learn more about parameters including how
and thresh
as well, they are quite self-explanatory by their names!
fillna()
: This function is handy to fill the missing data NA/NaN
values using the specified method such as bfill
or backfill
which uses next valid observation to fill gap whereas ffill
or pad
propagate last valid observation forward to next valid observation.
The parameter axis = 0/row/index
and 1/column
, whereas inplace
must be set to True
for permanent change.
unique()
: This function returns all the unique values in your selected data. Let's see how it works on all the columns in our dataframe.
nunique()
: Find returns how many unique values exist
.
☞ Notice the difference here, NaN
is counted as a missing value and will not be included, you will get "3" for the col_3.
value_counts() : This is another elegant function provided by pandas, think about the data with millions of entries and you want to know how many times each value is appeared in your certain column, value_counts()
is going to be your option as this will return all the values along with their count!
☞ For NaN, it count a missing value, nothing in the output.
✅ unique(), unique(), value_counts()
are three very useful and frequently used methods, associated with finding unique values in the dataset.
sort_values()
: By default, it rearranges the data in ascending order and for permanent change, inplace
needs to be set True
. As usual, axis is either 0 or 1 and the default sorting algorithm is quicksort
. na_position
is another parameter which is sometime useful (depending upon the work), it is set to last by default which puts NaNs
at the end.
apply()
: Indeed, this is one of the most powerful pandas feature. Using apply()
method, we can broadcast our customized functions on our data.
Let's practically use this method to calculate square of the values in col_1
.
Let’s broadcast our customized function "square"
using "apply"
method to calculate squares of the col_1
in our DataFrame, df
.
lambda
expression is must more convenient for the above operation, let's try!
Try the code below, what do you see and why? TypeError: object of type 'float' has no len()
?
de['col_3'].apply(len)
Data Selection
Lets talk about Selecting Data once again. We have learned to grab data in our previous lectures as well.
- We can grab a column with its name, do the conditional selection and much more ….
- We can use
loc
andiloc
to find rows as well.
Let’s revise the conditional selection, this also includes data selection based on the column name.
Lets do the following steps:
- Task 1:
df['col_1'] > 2
: returns the data where condition isTrue
(if you remember, this is just a boolean series) - Task 2:
df['col_2'] == 111
: returns the data where condition isTrue
- Task 3: Lets combine these two conditions with
&
by putting them in()
. - Task 4: wrap them in
df[]
and see what it returns!
Our one line code for the above 4 tasks is (df['col_1'] > 2) & (df['col_2'] == 111)
Good to know
Let’s create a copy of df
and try inplace=True
to see the practical example of this parameter.
pivot_table()
Moving forward, let’s explore one of the most powerful operation that we usually perform in excel, pivot table
. pivot_table()
is indeed one of the key operation that is conveniently handled in pandas, which creates a spreadsheet-style pivot table as a DataFrame. The function takes three key parameters "value, index, columns"
and create hierarchical indexes on the give index and columns in the resultant dataframe.
values
default is Noneindex
default is Nonecolumns
default is None
Let’s create a pivot table from our dataframe df
. We want our data points from col_2, so, values = 'col_2'
, we want col_1 as index index='col_1'
and column coming from col_3 columns='col_3'
.
☞ Excel users are usually familiar with pivot_table, if you just learned about this function, don’t worry at this stage as we will be discussing it in the coming sections according to our requirements.
Important Notice : NaN
are appeared as values from col_2, what happened to NaN
in col_3?
Well, NaN
in col_3 will not be used for the column name in the pivot table, hence index 4 and 5 are skipped.
Let’s have a look on another example for pivot_table()
In our resultant dataframe from foobar, we want:
- data points from
D
==>values = 'D'
- index from
A,B
in multilevel ==>index = ['A','B']
- columns from
C
, ==>columns = ['C']
.
This lecture was little long, but you did it, amazing!
This was all about pandas essentials that we wanted to learn at the moment. The concept of pivot table might look little complex and needs little understanding (if this is your first time), I suggest repeating with more examples.
Please have a quick recap and try repeating whole lecture in a new jupyter notebook. We are going to apply all of these concepts in the coming exercises and also through out in this course! Keep practicing to brush-up and add new skills.
Excellent work!
Your clap and share can help us to reach to someone who is struggling to learn these concepts.
Good luck!
See you in the next lecture on “A12: Pandas (Practice Exercises >> 1: Ecommerce Purchases)”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn or through ScienceAcademy.ca.