A21: Pandas Built-in Data Visualization Capabilities

Junaid Qazi, PhD
8 min readDec 13, 2021

--

This article is a part of Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)

Click here for the previous article/lecture on “A20: Seaborn (Part-5): Statistical Data Visualization (Controlling Figure Aesthetics)”.

✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”

💐Click here to FOLLOW ME for new contents💐

Welcome to pandas for Data Visualization now.

After exploring matplotlib and seaborn let’s have a quick look at pandas for visualization purpose.

Indeed, pandas is a the most important took for data scientists today. Not only analysis, we can get acceptable visualizations using pandas as well. Let’s talk about pandas built-in data visualization capabilities that are actually built on matplotlib and embedded into pandas for quick use.

Let’s take a look!

Let’s create some random datasets to work with and learn-by-doing. df1 with date as index and df2 with sequential index. A good thing is, this will be a good practice for your skills as well!

Although, we can use seaborn'sbuilt-in datasets as well. However, to make this lecture as a separate and independent, I am creating new dataframes (You can use any dataset of your own choice to do the practice).

Generating dataframe df1

Let’s add col_D as a column to df1

Generating dataframe df2

So, now we have the dataframes, let’s start with simple histogram plot, we can get the hist plot in two ways!

  • DataFrame.hist()
  • DataFrame.plot(kind = 'hist')

Let’s try both one-by-one first, later on, we will talk about style sheet and discuss the range of plotting options in details!

Style Sheets

Matplotlib has style sheets. We can use them to make our plots look a little nicer. These style sheets include plot_bmh,plot_fivethirtyeight,plot_ggplot and more. They basically create a set of style rules that our plots follow. Its good to use them because they make all our plots to have the same look and feel more professional.
Let's call the style first!

Let’s stick with the ggplot style and explore how to utilize pandas built-in plotting capabilities!

Plot Types

There are several built-in plot types (given below) in pandas, most of them are statistical plots by nature:

  • df.plot.area
  • df.plot.bar
  • df.plot.barh
  • df.plot.hist
  • df.plot.line
  • df.plot.scatter
  • df.plot.box
  • df.plot.hexbin
  • df.plot.kde
  • df.plot.density
  • df.plot.pie

We can also just call df.plot(kind='hist') or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh', etc..)
Let's go through these plots one-by-one using our data frames df1 and df2!

Area plot

Bar & barh plots

Histograms

We can get all the columns on the same plot!

A histogram can be stacked using stacked=True. This is not what we use often but it is good to know!

We can pass other keywords supported by matplotlib hist. For example, horizontal and cumulative histograms can be drawn by orientation='horizontal' and cumulative=True.
Let's pass orientation='horizontal.

Line plots

Notice, we don’t need to pass anythin if we want index along “x”. Our index consists of dates, it try to format the x-axis nicely.
To Do: Try matplotlib's gcf().autofmt_xdate() and see the difference!
A Quick note: df1.plot.line(x = df1.index, y='B',figsize=(12,3),lw=1) worked with older versions of pandas, we can skip while using index in the newer versions!

Scatter plots

Scatter plot can be drawn by using the DataFrame.plot.scatter() method. Scatter plot requires numeric columns for "x" and "y" axes. These can be specified by "x" and "y" keywords.
Let's try a scatter plot with df1 here!

To plot multiple column groups on a single axes, we need to repeat plot method specifying target axes ax. It is recommended to specify color and label keywords to distinguish each groups.

The keyword c may be given as the name of a column to provide colors for each point. In this way, you are getting the information of third column in the form of color. cmap can be used to indicate colormap in this case. Its handy!!
For all the colormaps, click here

We can use s to indicate size of the bubble based on another column.
s parameter needs to be an array, not just the name of a column:

BoxPlots

To visualize the distribution of values within each column. Recall your understanding of box plot from seaborn!

We can color the boxplot by passing color keyword. We can pass a dictionary whose keys are boxes, whiskers, medians and caps. If some keys are missing in the dictionary, default colors are used for the corresponding artists.
Let's create a dictionary color to create a boxplot with different colors!

We can also pass a by argument for groupby in the box plots.
We need to create another column with some choices e.g. "A" and "B" in our dataframe for this purpose.

We can use DataFrame.boxplot to create a boxplot for column "X" and "Y". Let's pass by="group" now.

So, in the plot above, we got two boxplots for the passed columns “X” and “Y”. Now, I am going to drop ‘g’ columns because I don’t need this anymore!.

Hexagonal Bin Plot

Consider, your data are too dense to plot each point individually, hexbin plots are very useful alternative to the scatter plots under such situation! (recall you understanding of hexbin plot from seaborn section)

A useful keyword argument is gridsize, it controls the number of hexagons in the x-direction, and defaults to 100. A larger gridsize means more, smaller bins.
Let's pass the gridsize = 25 and see how the above plots look like!

The above hexbin plot is very useful, more the datapoints are in the hexbin, darker the bin is!

Kernel Density Estimation plot (KDE)

This is another useful plots, and we have already learned about it in details in the seaborn section. let’s try to create kde plot in pnadas, its super easy!

We can use density() as well!
Let's try the complete dataframe df1 with density!

Pie Plot

We can create a pie plot with DataFrame.plot.pie() or Series.plot.pie(). If your data includes any NaN, they will be automatically filled with 0. A ValueError will be raised if there are any negative values in your data.
Let's try pie plot with df2.

We need to specify a target column by the y argument. When y is specified, pie plot of selected column will be drawn.

The plot above does not look nice, lets remove the legend and set the figure size!

Instead of passing y, we can pass-in subplots=True. This will generate pie plots for each column as subplots. A legend will be drawn in each pie plots by default; specify legend=False to hide it.

This was all about the data visualization using pandas. You can see that how convenient it is to use panda’s data visualization capabilities while Exploratory Data Analysis (EDA). It balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib library call.

💐Click here to FOLLOW ME for new contents💐

<<Keep practicing to brush-up and add new skills>>

Excellent work!

Your clap and share can help us to reach to someone who is struggling to learn these concepts.

Good luck!

See you in the next lecture on “A22: Linear Regression (Part-1) — Simple & Multiple Linear Regression, Under & Over Fitting, No Free Lunch, Bias Variance Trade-off”.

Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:

About Dr. Junaid Qazi:

Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn.

--

--

Junaid Qazi, PhD
Junaid Qazi, PhD

Written by Junaid Qazi, PhD

We offer professional development, corporate training, consulting, curriculum and content development in Data Science, Machine Learning and Blockchain.

No responses yet