A21: Pandas Built-in Data Visualization Capabilities
This article is a part of “Data Science from Scratch — Can I to I Can”, A Lecture Notes Book Series. (click here to get your copy today!)
✅ A Suggestion: Open a new jupyter notebook and type the code while reading this article, doing is learning, and yes, “PLEASE Read the comment, they are very useful…..!”
Welcome to pandas for Data Visualization now.
After exploring matplotlib and seaborn let’s have a quick look at pandas for visualization purpose.
Indeed, pandas is a the most important took for data scientists today. Not only analysis, we can get acceptable visualizations using pandas as well. Let’s talk about pandas built-in data visualization capabilities that are actually built on matplotlib
and embedded into pandas for quick use.
Let’s take a look!
Let’s create some random datasets to work with and learn-by-doing. df1
with date as index
and df2
with sequential index
. A good thing is, this will be a good practice for your skills as well!
Although, we can use seaborn's
built-in datasets as well. However, to make this lecture as a separate and independent, I am creating new dataframes (You can use any dataset of your own choice to do the practice).
Generating dataframe df1
Let’s add col_D
as a column to df1
Generating dataframe df2
So, now we have the dataframes, let’s start with simple histogram plot
, we can get the hist plot in two ways!
DataFrame.hist()
DataFrame.plot(kind = 'hist')
Let’s try both one-by-one first, later on, we will talk about style sheet and discuss the range of plotting options in details!
Style Sheets
Matplotlib has style sheets. We can use them to make our plots look a little nicer. These style sheets include plot_bmh
,plot_fivethirtyeight
,plot_ggplot
and more. They basically create a set of style rules that our plots follow. Its good to use them because they make all our plots to have the same look and feel more professional.
Let's call the style first!
Let’s stick with the ggplot
style and explore how to utilize pandas built-in plotting capabilities!
Plot Types
There are several built-in plot types (given below) in pandas, most of them are statistical plots by nature:
df.plot.area
df.plot.bar
df.plot.barh
df.plot.hist
df.plot.line
df.plot.scatter
df.plot.box
df.plot.hexbin
df.plot.kde
df.plot.density
df.plot.pie
We can also just call df.plot(kind='hist')
or replace that kind argument with any of the key terms shown in the list above (e.g. 'box','barh'
, etc..)
Let's go through these plots one-by-one using our data frames df1
and df2
!
Area plot
Bar & barh plots
Histograms
We can get all the columns on the same plot!
A histogram can be stacked using stacked=True
. This is not what we use often but it is good to know!
We can pass other keywords supported by matplotlib hist
. For example, horizontal
and cumulative
histograms can be drawn by orientation='horizontal
' and cumulative=True
.
Let's pass orientation='horizontal
.
Line plots
Notice, we don’t need to pass anythin if we want index along “x”. Our index consists of dates, it try to format the x-axis nicely.
To Do: Try matplotlib's
gcf().autofmt_xdate()
and see the difference!
A Quick note: df1.plot.line(x = df1.index, y='B',figsize=(12,3),lw=1)
worked with older versions of pandas, we can skip while using index in the newer versions!
Scatter plots
Scatter plot can be drawn by using the DataFrame.plot.scatter()
method. Scatter plot requires numeric columns for "x" and "y" axes. These can be specified by "x" and "y" keywords.
Let's try a scatter plot with df1
here!
To plot multiple column groups on a single axes, we need to repeat plot
method specifying target axes ax
. It is recommended to specify color
and label
keywords to distinguish each groups.
The keyword c
may be given as the name of a column to provide colors for each point. In this way, you are getting the information of third column in the form of color. cmap
can be used to indicate colormap in this case. Its handy!!
For all the colormaps, click here
We can use s
to indicate size of the bubble based on another column. s
parameter needs to be an array, not just the name of a column:
BoxPlots
To visualize the distribution of values within each column. Recall your understanding of box plot from seaborn!
We can color the boxplot
by passing color
keyword. We can pass a dictionary whose keys are boxes, whiskers, medians and caps. If some keys are missing in the dictionary, default colors are used for the corresponding artists.
Let's create a dictionary color
to create a boxplot with different colors!
We can also pass a by
argument for groupby in the box plots.
We need to create another column with some choices e.g. "A" and "B" in our dataframe for this purpose.
We can use DataFrame.boxplot
to create a boxplot for column "X" and "Y". Let's pass by="group"
now.
So, in the plot above, we got two boxplots for the passed columns “X” and “Y”. Now, I am going to drop ‘g’ columns because I don’t need this anymore!.
Hexagonal Bin Plot
Consider, your data are too dense to plot each point individually, hexbin
plots are very useful alternative to the scatter plots under such situation! (recall you understanding of hexbin plot from seaborn section)
A useful keyword argument is gridsize
, it controls the number of hexagons in the x-direction, and defaults to 100. A larger gridsize
means more, smaller bins.
Let's pass the gridsize = 25
and see how the above plots look like!
The above hexbin
plot is very useful, more the datapoints are in the hexbin, darker the bin is!
Kernel Density Estimation plot (KDE)
This is another useful plots, and we have already learned about it in details in the seaborn section. let’s try to create kde plot in pnadas, its super easy!
We can use density()
as well!
Let's try the complete dataframe df1 with density!
Pie Plot
We can create a pie plot with DataFrame.plot.pie()
or Series.plot.pie()
. If your data includes any NaN, they will be automatically filled with 0. A ValueError will be raised if there are any negative values in your data.
Let's try pie plot with df2.
We need to specify a target column by the y argument. When y is specified, pie plot of selected column will be drawn.
The plot above does not look nice, lets remove the legend and set the figure size!
Instead of passing y, we can pass-in subplots=True
. This will generate pie plots for each column as subplots. A legend will be drawn in each pie plots by default; specify legend=False to hide it.
This was all about the data visualization using pandas. You can see that how convenient it is to use panda’s data visualization capabilities while Exploratory Data Analysis (EDA). It balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib
library call.
💐Click here to FOLLOW ME for new contents💐
<<Keep practicing to brush-up and add new skills>>
Excellent work!
Your clap and share can help us to reach to someone who is struggling to learn these concepts.
Good luck!
See you in the next lecture on “A22: Linear Regression (Part-1) — Simple & Multiple Linear Regression, Under & Over Fitting, No Free Lunch, Bias Variance Trade-off”.
Note: This complete course, including video lectures and jupyter notebooks, is available on the following links:
Dr. Junaid Qazi is a Subject Matter Specialist, Data Science & Machine Learning Consultant and a Team Builder. He is a Professional Development Coach, Mentor, Author, and Invited Speaker. He can be reached for consulting projects and/or professional development training via LinkedIn.