A11: Pandas (Part-4): Useful Methods and Operations

Junaid Qazi, PhD
7 min readDec 10, 2020

--

This article is a part of “Data Science from Scratch — Can I to I Can”series.

Click here for the previous article/lecture on “A10: Pandas (Part-3): Merging, Combining & GroupBy”.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

So, let’s have a quick look at some of the most important methods that you will be using all the time. Along with describe() which gives the basic statistics on your data, we have already covered head(), isnull(), dropna(), fillna() etc.

In this lecture, we will explore other general purpose operations and revise what we have learned in the previous lectures.
Let’s create a dataframe to get hands-on experience on these operations.

Let’s create a dataframe with repeated values along with missing data NaN.

We can start with what we already know.

info() : Provides a concise summary of a DataFrame. We will use this function very often in the course.

head(n) : Returns the first n (default is 5) rows. Its very useful to get the overview on our data. We will use this function very often in the course.

isnull() : Return a boolean same-sized object indicating if the values are null.

dropna(): Another useful function that can be used to drop the missing data. The parameter axis defines row or columnand its default value is 0 means row where as axis=1 is for column.inplace is another very useful parameter, it set True, any change on the dataframe will be permanent, its default value is set to False. Explore the documentation to learn more about parameters including how and thresh as well, they are quite self-explanatory by their names!

fillna() : This function is handy to fill the missing data NA/NaN values using the specified method such as bfill or backfill which uses next valid observation to fill gap whereas ffill or pad propagate last valid observation forward to next valid observation.
The parameter axis = 0/row/index and 1/column, whereas inplace must be set to True for permanent change.

unique() : This function returns all the unique values in your selected data. Let's see how it works on all the columns in our dataframe.

nunique() : Find returns how many unique values exist.
Notice the difference here, NaN is counted as a missing value and will not be included, you will get "3" for the col_3.

value_counts() : This is another elegant function provided by pandas, think about the data with millions of entries and you want to know how many times each value is appeared in your certain column, value_counts() is going to be your option as this will return all the values along with their count!
For NaN, it count a missing value, nothing in the output.

unique(), unique(), value_counts() are three very useful and frequently used methods, associated with finding unique values in the dataset.

sort_values() : By default, it rearranges the data in ascending order and for permanent change, inplace needs to be set True. As usual, axis is either 0 or 1 and the default sorting algorithm is quicksort. na_position is another parameter which is sometime useful (depending upon the work), it is set to last by default which puts NaNs at the end.

apply(): Indeed, this is one of the most powerful pandas feature. Using apply() method, we can broadcast our customized functions on our data.
Let's practically use this method to calculate square of the values in col_1.

Let’s broadcast our customized function "square" using "apply" method to calculate squares of the col_1 in our DataFrame, df.

lambda expression is must more convenient for the above operation, let's try!

Try the code below, what do you see and why? TypeError: object of type 'float' has no len()?

de['col_3'].apply(len)

Data Selection

Lets talk about Selecting Data once again. We have learned to grab data in our previous lectures as well.

  • We can grab a column with its name, do the conditional selection and much more ….
  • We can use loc and iloc to find rows as well.

Let’s revise the conditional selection, this also includes data selection based on the column name.

Lets do the following steps:

  • Task 1: df['col_1'] > 2 : returns the data where condition is True (if you remember, this is just a boolean series)
  • Task 2: df['col_2'] == 111 : returns the data where condition is True
  • Task 3: Lets combine these two conditions with & by putting them in ().
  • Task 4: wrap them in df[] and see what it returns!

Our one line code for the above 4 tasks is (df['col_1'] > 2) & (df['col_2'] == 111)

Good to know

Let’s create a copy of df and try inplace=True to see the practical example of this parameter.

pivot_table()

Moving forward, let’s explore one of the most powerful operation that we usually perform in excel, pivot table.
pivot_table() is indeed one of the key operation that is conveniently handled in pandas, which creates a spreadsheet-style pivot table as a DataFrame. The function takes three key parameters "value, index, columns" and create hierarchical indexes on the give index and columns in the resultant dataframe.

  • values default is None
  • index default is None
  • columns default is None

Let’s create a pivot table from our dataframe df. We want our data points from col_2, so, values = 'col_2', we want col_1 as index index='col_1' and column coming from col_3 columns='col_3'.

Excel users are usually familiar with pivot_table, if you just learned about this function, don’t worry at this stage as we will be discussing it in the coming sections according to our requirements.

Important Notice : NaN are appeared as values from col_2, what happened to NaN in col_3?
Well, NaN in col_3 will not be used for the column name in the pivot table, hence index 4 and 5 are skipped.

Let’s have a look on another example for pivot_table()

In our resultant dataframe from foobar, we want:

  • data points from D ==> values = 'D'
  • index from A,B in multilevel ==> index = ['A','B']
  • columns from C, ==> columns = ['C'].

This lecture was little long, but you did it, amazing!
This was all about pandas essentials that we wanted to learn at the moment. The concept of pivot table might look little complex and needs little understanding (if this is your first time), I suggest repeating with more examples.

Please have a quick recap and try repeating whole lecture in a new jupyter notebook. We are going to apply all of these concepts in the coming exercises and also through out in this course! Keep practicing to brush-up and add new skills.

Excellent work!

Your clap and share can help us to reach to someone who is struggling to learn these concepts.

Good luck!

See you in the next lecture on A12: Pandas (Practice Exercises >> 1: Ecommerce Purchases)”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn or through ScienceAcademy.ca.

--

--

Junaid Qazi, PhD
Junaid Qazi, PhD

Written by Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.