Plotly Fundamentals -
Basic Charts

In this part of the Plotly tutorial series we will look into common chart types and how to manipulate their appearance. We will also learn how to leverage additional features when digging into an interesting data set and uncover interesting features in a visual way. As usual, we will begin by importing the plotly express as well as pandas packages.

import plotly.express as px
import pandas as pd

Plotly express provides many interesting data sets via their API that are very good if you are simply exploring some of the features the library provides. For this chapter we will use the iris data set which is very popular choice to introduce a variety of data science concepts. It contains information about three different species of flowers. We will use some of the basic charts provided by Plotly to see what we can learn from this data. It is imported as a pandas data frame so we can use the head() function to get a glimpse into how our dataset looks.

iris = px.data.iris()
iris.head()
sepal_length sepal_width petal_length petal_width species species_id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1

One of the most common ways to display the distribution of a dataset is the histogram which displays the number of data sets that fall into a range of values called “bins”. We can achieve this easily by using the function histogram providing the dataset as input and the column for which the counts should be displayed. In our case we will use “sepal_width” which refers to a column in our dataset. You can also pass a value for the number of bins you want to display with the nbins parameter and update the width between gaps for some additional styling.

fig = px.histogram(iris, x="sepal_width", nbins=40)
fig.update_layout(bargap=0.1)
fig.show()

Alright cool, this looks pretty much like a standard bell shaped distribution. What we can do as a next step is to have our data bars stacked and coloured via categorical attributes. To do that, we pass the column name ‘species’ to the color parameter.

fig = px.histogram(iris, x="sepal_width", color='species',nbins= 40)
fig.update_layout(bargap=0.1)
fig.show()

While this is looking pretty good already it does not tell us much about the characteristics of our data set. Just knowing the sepal_width does not seem to tell us much about the species of a particular flower. So lets try whether the other columns provide further structure that gives us a better idea.

fig = px.histogram(iris, x="petal_width", nbins=40)
fig.update_layout(bargap=0.1)
fig.show()

Interestingly, once we display pedal_width there seems to be a cluster of data separate from the rest at the lower end of the scale of measured values. Application of the same coloring as above reveals that all those values seem to belong to the species Setosa which is interesting because it means that we might be able to separate Setosa flowers from others just by knowing a flowers petal width. For the other two species this is not that easy. This shows why the Iris data set is so popular as it demonstrates potential avenues for classification of data in an easy to understand way.

fig = px.histogram(iris, x="petal_width", color='species',nbins= 40)
fig.update_layout(bargap=0.1)
fig.show()

Next, we will see what else we can learn from the data by naively exploring the data in visual fashion. To do that, we use the marginal parameter which lets you add a small statistical representation of the data for the categories you provided in the color argument. We will use the the ‘box’ argument but other represantations such as ‘violin’, ‘rug’ and others are possible.

fig = px.histogram(iris, x="petal_width", color='species',nbins= 40, marginal = 'box', height= 600)
fig.update_layout(bargap=0.1)
fig.show()

One more thing that comes in handy is to normalize the data such that the y axis is denominated in percent rather than a count of values in the bin.

fig = px.histogram(iris, x="petal_width",color = 'species', nbins= 40, marginal = 'box', height= 600, histnorm='percent')
fig.update_layout(bargap=0.1)
fig.show()

Another popular way to spice up your histograms is by adding a continuous interpolation of the statistical distribution represented by the histogram bars. This is often called list plot. Unfortunately, Plotly does not provide that feature as part of  the express nor or the graph objects packages. In this case we have to resort to the more experimental figure factory package which is a collection of some of the more fringe plot types you might want to produce.  Here, the data and classifications have to be passed as lists which is why we are doing a simple replacing exercise before passing our values to the function create_distplot.

import plotly.figure_factory as ff

data = [iris[iris['species']==sp]['petal_width'].values for sp in iris['species'].unique()]
species = list(iris['species'].unique())

fig = ff.create_distplot(data, species, bin_size = 0.1)
fig.update_layout(height = 600)
fig.show()

Alright, now you might wonder what the hell you are supposed to do if your dataset has many dimensions and you don’t know what to look for. One option is certainly to just try your luck and iterate through all sorts of data slices until you find something worth exploring further. I am certainly guilty of doing that myself out of nothing else than laziness. However, there is a smarter way to go about it. We can simply dump our entire data set into the express function scatter_matrix. The function provides you with a large NxN canvas of subplots representing all the marginals of your data. The best of that is that you can use the selector tools Plotly provides with every chart to cross filter and dig for interesting angles.

fig = px.scatter_matrix(iris, height = 800)
fig.show()

In case you want to focus on a part of the dataset you can simply provide a set of column names to the dimensions argument. As before, we can further improve the plot by coloring the data points according to a set of classes via the color argument. This is often useful to see in one single plot how categorical values are allocated across the data set. Once again, this technique reveals how the Setosa species can be singled out easily from the other two. As oppose to the singular slice we looked at before, we have now revealed some features that might be suited for a classification strategy for the Versicolor and Virginia species such as a combination of sepal length and sepal width.

fig = px.scatter_matrix(iris, dimensions = ['sepal_width', 'sepal_length', 'petal_width', 'petal_length'], color = 'species', height = 800)
fig.show()

Finally, lets have a look at another nifty feature that we have seen when working with one dimensional input data used by histograms. For scatter and other graph types you can specify a marginal distribution for each axis of the chart. You can mix and match different types of marginals as you see fit. They are easily included by passing values to the arguments marginal_x and marginal_y.

fig = px.scatter(iris, x = 'sepal_length', y = 'sepal_width', height = 800, marginal_x = 'box', marginal_y = 'violin')
fig.show()

If we now use the same trick we have employed several times now by coloring the values based on their species value and additional marginals we will get a split of the produced statistical plots per category which is pretty cool.

fig = px.scatter(iris, x = 'sepal_length', y = 'sepal_width', color = 'species', height = 800, marginal_x = 'box', marginal_y = 'violin')
fig.show()

We made it through the second part of the Plotly Fundamentals series. While we have only worked with a couple examples in this chapter I can only recommend you to check out the Plotly chart gallery which has many many more awesome examples that are simple variations of the charts shown above. We will go through some of the more special chart types and additional features in the upcoming chapters so stay tuned. Peace!

fistofgeek.com

coding - data science - finance

Get Connected