Data Visualization

What is Data Visualization

Visualizing data in easy to understand graphical formats is as old a practice as mathematics itself. We visualize data to explore, to summarize, to compare, to understand trends and above all, to be able to tell a story.

Why is visualization so powerful?
Visualization is powerful because it cues into the 'pre-attentive attributes' that our brain can process extremely fast. They are information the human brain can process visually almost immediately, and patterns we can detect without thinking or processing.

Consider the picture below, and you can see how our mind is instantly directed to key elements highlighted.

image.png

Source: https://help.tableau.com/current/blueprint/en-us/bp_why_visual_analytics.htm

Things to bear in mind
When thinking of using visualization to explore data, or tell a story from data, we generally intuitively know what makes sense. This is because we live in a visual world, and just by experience know what works and what is less effective.

Yet it is - Know your audience: are they mathematically savvy, or lay people?
- Know your data: what kind of data do you have decides the kind of visualization to use?
- Consider the delivery mechanism: will the visualization be delivered over the web, in print, in a PowerPoint, or in an email?
- Select the right visuals: in addition to the type of chart or visualization, think about special effects from lines, markers, colors
- Use common visual cues in charts: use the same color schemes, similar visual pointers so that the audience is immediately oriented to the most important aspects
- Make your point stand out: is the graphic highlighting the point you are trying to make?
- Consider the ‘Data-Ink Ratio’: Data-ink ratio is ratio of Ink that is used to present actual data compared to the total amount of ink used in the graphic)
- Be credible, avoid games: build trust in your work for your audience
- Consider repeatability: how difficult would it be for you to do the same work a month down the line?
- Avoid 3D, doughnuts, pie charts: they confuse and obfuscate, and do not impress an educated audience
- Finally, always label the axes !

Key chart types
Visualization is a vast topic. It includes customized graphics, dashboards that combine text and multiple visualizations in one place, and interactive drill-downs. Yet the building blocks of all of these are a set of basic chart types, which are what we will cover here. Dashboards are creative combinations of these, combined with text and lists.

For our purposes, it will suffice if we look at the major types of charts in our toolkit, and how and when it is appropriate to use them.

We will cover:

  • Histograms
  • Barplots
  • Boxplots
  • Scatterplots
  • Lineplots
  • Pairplots
  • Heatmaps

Usual library imports

import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statsmodels.api as sm

Load data

As before, we will use the diamonds dataset from seaborn. This is a dataset of abotu 50,000 diamonds with their prices, and other attributes such as carat, color, clarity etc.

df = sns.load_dataset('diamonds')

Histograms and KDEs

A histogram is a visual representation of the distribution of continuous numerical data. It helps you judge the ‘shape’ of the distribution. A histogram is a fundamental tool for exploring data.

To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.

The size of the bin interval matters a great deal, as we will see below.

The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins must be adjacent and are often (but not required to be) of equal size.

Histograms versus bar-charts: Histograms are different from bar-charts which are used for categorical data. You will notice that the bars for a histogram are right next to each other to indicate continuity, whereas for a bar-chart they are separated by a gap or white space, indicating that there is no continuity or order implied by the placement of the bars.

# Plot a histogram of diamond prices

ax = sns.histplot(np.log(df['price']),  bins = 20 ); 

png

# Plot a histogram of diamond prices
x = np.arange(0,20000,1000)

fig, ax = plt.subplots()
# y = sns.histplot(df['price'],  bins = x ).set(title='Count of diamonds by price')
plt.xticks(x)
plt.xticks(rotation=90)


y = sns.histplot(df.price, element="bars", bins=x, stat='count', legend=False)


for i in y.containers:
    y.bar_label(i,fontsize=7,color='b')
plt.rcParams['font.size'] = 10
ax.set_title('Count of diamonds by price')
plt.show()

png


# Plot a kernel density estimate plot of diamond prices

sns.kdeplot(data=df, x=np.log(df['price']));

png

Histograms and distribution types

Histograms are extremely useful to understand where the data is when looked at as a distribution. They are the truest embodiment of the saying that a picture is worth a thousand words.

Next, let us look at some common distributions - the normal distribution, left and right skewed distribution, bi-modal distribution and the uniform distribution. They are constructed below using artificial random data, but the goal is to emphasize the shape of the resulting distribution that you could observe with real data as well.

# Plot a histogram of 100000 normal random variables, split across 40 bins

sns.histplot(np.random.normal(size = 100000),  bins = 40 ); 

png

# Plot a right skewed histogram of 100000 normal random variables, 
# split across 40 bins, (using the beta distribution)

sns.histplot(np.random.beta(a=0.3, b = 1,size = 100000),  bins = 40 );

png

# Plot a uniform distribution

sns.histplot(np.random.beta(a=1, b = 1,size = 100000),  bins = 40 ); 

png

# Plot a left skewed histogram 

sns.histplot(np.random.beta(a=1, b = .4,size = 100000),  bins = 40 ); 

png

# Plot a bi-modal histogram.
# Notice the 'trick' used - we add two normal distributions
# with different means

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 2, size = 100000))
list3 = list1 + list2
sns.histplot(list3,  bins = 40 ); 

png

# Another graphic where we get several peaks in the histogram

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 1, size = 100000))
list3 = list(np.random.normal(loc = 9, scale = 1.2, size = 100000))
list4 = list1 + list2 + list3
sns.histplot(list4,  bins = 60 ); 

png

# Finally, we bring all the above different types of distributions
# together in a single graphic


fig, axes = plt.subplots(nrows = 2, ncols = 3, figsize = (21,9))
ax1, ax2, ax3, ax4, ax5, ax6 = axes.flatten()
ax1 = sns.histplot(np.random.normal(size = 100000),  bins = 40, ax = ax1 ) 
ax1.set_title('Normal')

ax2 = sns.histplot(np.random.beta(a=0.3, b = 1,size = 100000),  bins = 40, ax = ax2 )
ax2.set_title('Skewed Right')

ax3 = sns.histplot(np.random.beta(a=1, b = .4,size = 100000),  bins = 40, ax = ax3 ) 
ax3.set_title('Skewed Left')

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 2, size = 100000))
list3 = list1 + list2
sns.histplot(list3,  bins = 40, ax = ax4 ) 
ax4.set_title('Bimodal')

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 1, size = 100000))
list3 = list(np.random.normal(loc = 9, scale = 1.2, size = 100000))
list4 = list1 + list2 + list3
sns.histplot(list4,  bins = 60, ax = ax5 ) 
ax5.set_title('Multimodal')

sns.histplot(np.random.beta(a=1, b = 1,size = 100000),  bins = 40, ax = ax6 ) 
ax6.set_title('Uniform');


png

Bin Size

For histograms, bin size matters. Generally, the larger the bin size, the less information you will see. See examples below of the visualization of the same data but using different bin sizes.

Look at the two plots below, they represent the same data, but with vastly different bin intervals. They tell completely different stories!

Bin boundaries should align with how people would read the data.

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 2, size = 100000))
list3 = list1 + list2
sns.histplot(list3,  bins = 50);

png

sns.histplot(list3,  bins = 5);

png

Kernel Density Estimates

Kernel density estimation is a way to estimate the probability density function (PDF) of a random variable.

One way to think about KDE plots is that these represent histograms with very small bin sizes where the tops of each bar has been joined together with a line.

While this simple explanation suffices for most of us, there is a fair bit of mathematics at work behind KDE plots. Consider the diagram below. Each small black vertical line on the x-axis represents a data point. The individual kernels (Gaussians in this example, but others can be specified as well) are shown drawn in dashed red lines above each point.

The solid blue curve is created by summing the individual Gaussians and forms the overall density plot.

image.png

Source: https://towardsdatascience.com/histograms-and-density-plots-in-python-f6bda88f5ac0

Next, let us create KDEs for our random data.

fig, axes = plt.subplots(nrows = 2, ncols = 3, figsize = (21,9))
ax1, ax2, ax3, ax4, ax5, ax6 = axes.flatten()
ax1 = sns.kdeplot(np.random.normal(size = 100000),  ax = ax1 ) 
ax1.set_title('Normal')

ax2 = sns.kdeplot(np.random.beta(a=0.3, b = 1,size = 100000),  ax = ax2 )
ax2.set_title('Skewed Right')

ax3 = sns.kdeplot(np.random.beta(a=1, b = .4,size = 100000),  ax = ax3 ) 
ax3.set_title('Skewed Left')

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 2, size = 100000))
list3 = list1 + list2
sns.kdeplot(list3,  ax = ax4 ) 
ax4.set_title('Bimodal')

list1 = list(np.random.normal(size = 100000))
list2 = list(np.random.normal(loc = 5, scale = 1, size = 100000))
list3 = list(np.random.normal(loc = 9, scale = 1.2, size = 100000))
list4 = list1 + list2 + list3
sns.kdeplot(list4,   ax = ax5 ) 
ax5.set_title('Multimodal')

sns.kdeplot(np.random.beta(a=1, b = 1,size = 100000),   ax = ax6 ) 
ax6.set_title('Uniform');


png


Barplots

Barplots are easy to understand, and require little explanation. They are also called bar graphs, or column charts. They are used for categorical variables and show the frequency of the observations for each of the categories.

Consider the number of diamonds for each color category. We can demonstrate the data as a data table, and then show the same information in a barplot.

Data Table

df = sns.load_dataset('diamonds')
df[['color']].value_counts().sort_index()
color
D         6775
E         9797
F         9542
G        11292
H         8304
I         5422
J         2808
Name: count, dtype: int64

Barplot

plt.figure(figsize = (14,5))
sns.countplot(x='color', data=df, order = np.sort(df['color'].unique()) );

png

Most software will allow you to combine several data elements together in a barplot. The chart below shows barplots for both cut and color.

plt.figure(figsize = (14,5))
sns.countplot(x='cut', data=df, hue='color').set_title('Cut vs color');

png

Stacked barplots

sns.histplot(data=df, hue="color", x="cut", shrink=.8, multiple = "stack")
<Axes: xlabel='cut', ylabel='Count'>

png

sns.histplot(data=df, hue="color", x="cut", shrink=.7, multiple = "fill")

<Axes: xlabel='cut', ylabel='Count'>

png


Boxplots

Boxplots are useful tools to examine distributions visually. But they can be difficult to interpret for non-analytical or non-mathematical users.

plt.figure(figsize = (4,12))
ax = sns.boxplot(data = df, y = 'carat', )
custom_ticks = np.linspace(0, 5, 51)
ax.set_yticks(custom_ticks);


png

Interpreting the boxplot:

The boxplot above has a lot of lines and points. What do they mean? The below graphic describes how to interpret a boxplot.

image.png

Compare the above image with the data for the quartiles etc below. You can see that the lines correspond to the actual calculations for min, max, quartiles etc.


Q3 = df.carat.quantile(0.75)
Q1 = df.carat.quantile(0.25)
Median = df.carat.median()
Min = df.carat.min()
Max = df.carat.max()
print('Quartile 3 is = ', Q3)
print('Quartile 1 is = ', Q1)
print('Median is = ', Median)
print('Min for the data is = ', Min)
print('Max for the data is = ', Max)
print('IQR is = ', df.carat.quantile(0.75) - df.carat.quantile(0.25))
print('Q3 + 1.5*IQR = ', Q3 + (1.5* (Q3 - Q1)))
print('Q1 - 1.5*IQR = ', Q1 - (1.5* (Q3 - Q1)))
Quartile 3 is =  1.04
Quartile 1 is =  0.4
Median is =  0.7
Min for the data is =  0.2
Max for the data is =  5.01
IQR is =  0.64
Q3 + 1.5*IQR =  2.0
Q1 - 1.5*IQR =  -0.5599999999999999

Barplot, with another dimension added
Below is another example of a boxplot, but with a split/dimension added for clarity.

plt.figure(figsize = (14,5))

sns.boxplot(data = df, x = 'clarity', y = 'carat')
<Axes: xlabel='clarity', ylabel='carat'>

png


Scatterplots

Unlike the previous chart types which focus on one variable, scatterplots allow us to examine the relationship between two variables.

At their core, they are just plots of (x, y) data points on a coordinate system.

Often a regression line is added to scatterplots to get a more precise estimate of the correlation. Outlier points can be identified visually.

If there are too many data points, scatterplots have the disadvantage of overplotting.

Consider the scatterplot below, which plots a random set of 500 data points from the diamonds dataset. We picked only 500 points to avoid overplotting. It shows us the relationship between price and carat weight.

## Scatterplot

sns.set_style(style='white')
plt.figure(figsize = (14,5))
sns.scatterplot(data = df.sample(500), x = 'carat', y = 'price', 
                alpha = .8, edgecolor = 'None');

png

## Scatterplot

sns.set_style(style='white')
plt.figure(figsize = (14,5))
sns.scatterplot(data = df.sample(500), x = 'carat', y = 'price', 
                hue = 'cut', alpha = .8, edgecolor = 'None');

png

Pick additional dimensions to represent through marker attributes using hue, size and style.

By creatively using hue, size and style, you can represent additional dimensions in a scatterplot.

sns.set_style(style='white')
plt.figure(figsize = (18,8))
sns.scatterplot(data = df.sample(500), x = 'carat', y = 'price', 
                hue = 'cut', size= 'carat', style = 'color',  alpha = .8, edgecolor = 'None');

png

Scatterplot - another example
We plot miles per gallon vs a car's weight (mtcars dataset)

plt.figure(figsize = (14,5))
mtcars = sm.datasets.get_rdataset('mtcars').data
sns.scatterplot(data = mtcars, x = 'mpg', y = 'wt', 
                hue = 'cyl', alpha = .8, edgecolor = 'None');

png

Plotting a regression line

Continuing the prior example of miles per gallon to weight, we add a regression line to the scatterplot.

Note that with Seaborn you are able to specify the confidence level around the line, and also the ‘order’ of the regression. Though we do not do that here, just something to be aware of. We will learn about the order of the regression in the chapter on regression.

plt.figure(figsize = (14,5))
mtcars = sm.datasets.get_rdataset('mtcars').data
sns.regplot(data = mtcars, x = 'mpg', y = 'wt', ci = 99, order = 1);

png

An example of overplotting
We talked about overplotting earlier. Overplotting happens when too many datapoints are plotted in a small space so they overlap each other and it becomes visually difficult to discern anything meaningful from the graphic.

Let us try to plot all the 50,000 diamonds together in a scatterplot (remember, we plotted only 500 earlier) and see what we get.

Obviously, this graphic is not very helpful as the datapoints are too crowded.

plt.figure(figsize = (8,8))
sns.scatterplot(data = df, x = 'carat', y = 'price', hue = 'cut', alpha = .8, edgecolor = 'None'); 

png

An example of a scatterplot with uncorrelated data
Here is a made up example of a scatterplot constructed with randomly selected x and y variables.

Since there is no correlation, the data appears as a cloud with no specific trend. The goal of this graphic is just to demonstrate what uncorrelated data would look like.

plt.figure(figsize = (8,8))
sns.scatterplot(x=np.random.normal(size = 100), y=np.random.normal(size = 100));

png


Anscombe's Quartet

Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."

Source: Wikipedia at https://en.wikipedia.org/wiki/Anscombe%27s_quartet

Anscombe's quartet is an example that illustrates how graphing the data can be a powerful tool providing insights that mere numbers cannot.

## Source: https://seaborn.pydata.org/examples/anscombes_quartet.html

anscombe = sns.load_dataset("anscombe")
anscombe
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
5 I 14.0 9.96
6 I 6.0 7.24
7 I 4.0 4.26
8 I 12.0 10.84
9 I 7.0 4.82
10 I 5.0 5.68
11 II 10.0 9.14
12 II 8.0 8.14
13 II 13.0 8.74
14 II 9.0 8.77
15 II 11.0 9.26
16 II 14.0 8.10
17 II 6.0 6.13
18 II 4.0 3.10
19 II 12.0 9.13
20 II 7.0 7.26
21 II 5.0 4.74
22 III 10.0 7.46
23 III 8.0 6.77
24 III 13.0 12.74
25 III 9.0 7.11
26 III 11.0 7.81
27 III 14.0 8.84
28 III 6.0 6.08
29 III 4.0 5.39
30 III 12.0 8.15
31 III 7.0 6.42
32 III 5.0 5.73
33 IV 8.0 6.58
34 IV 8.0 5.76
35 IV 8.0 7.71
36 IV 8.0 8.84
37 IV 8.0 8.47
38 IV 8.0 7.04
39 IV 8.0 5.25
40 IV 19.0 12.50
41 IV 8.0 5.56
42 IV 8.0 7.91
43 IV 8.0 6.89
# Print datasets for slides

print('Dataset 1:')
print('x:', list(anscombe[anscombe['dataset']=='I'].x))
print('y:', list(anscombe[anscombe['dataset']=='I'].y),'\n')
print('Dataset 2:')
print('x:', list(anscombe[anscombe['dataset']=='II'].x))
print('y:', list(anscombe[anscombe['dataset']=='II'].y),'\n')
print('Dataset 3:')
print('x:', list(anscombe[anscombe['dataset']=='III'].x))
print('y:', list(anscombe[anscombe['dataset']=='III'].y),'\n')
print('Dataset 4:')
print('x:', list(anscombe[anscombe['dataset']=='IV'].x))
print('y:', list(anscombe[anscombe['dataset']=='IV'].y),'\n')
Dataset 1:
x: [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
y: [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

Dataset 2:
x: [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
y: [9.14, 8.14, 8.74, 8.77, 9.26, 8.1, 6.13, 3.1, 9.13, 7.26, 4.74]

Dataset 3:
x: [10.0, 8.0, 13.0, 9.0, 11.0, 14.0, 6.0, 4.0, 12.0, 7.0, 5.0]
y: [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]

Dataset 4:
x: [8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 19.0, 8.0, 8.0, 8.0]
y: [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5, 5.56, 7.91, 6.89]
print(np.corrcoef(anscombe[anscombe['dataset']=='I'].x,anscombe[anscombe['dataset']=='I'].y))
print(np.corrcoef(anscombe[anscombe['dataset']=='II'].x,anscombe[anscombe['dataset']=='II'].y))
print(np.corrcoef(anscombe[anscombe['dataset']=='III'].x,anscombe[anscombe['dataset']=='III'].y))
print(np.corrcoef(anscombe[anscombe['dataset']=='IV'].x,anscombe[anscombe['dataset']=='IV'].y))
[[1.         0.81642052]
 [0.81642052 1.        ]]
[[1.         0.81623651]
 [0.81623651 1.        ]]
[[1.         0.81628674]
 [0.81628674 1.        ]]
[[1.         0.81652144]
 [0.81652144 1.        ]]
# We rearrage the data as to put the four datasets in the quarted side-by-side

pd.concat([anscombe.query("dataset=='I'").reset_index(), 
           anscombe.query("dataset=='II'").reset_index(), 
           anscombe.query("dataset=='III'").reset_index(), 
           anscombe.query("dataset=='IV'").reset_index()],axis=1)

index dataset x y index dataset x y index dataset x y index dataset x y
0 0 I 10.0 8.04 11 II 10.0 9.14 22 III 10.0 7.46 33 IV 8.0 6.58
1 1 I 8.0 6.95 12 II 8.0 8.14 23 III 8.0 6.77 34 IV 8.0 5.76
2 2 I 13.0 7.58 13 II 13.0 8.74 24 III 13.0 12.74 35 IV 8.0 7.71
3 3 I 9.0 8.81 14 II 9.0 8.77 25 III 9.0 7.11 36 IV 8.0 8.84
4 4 I 11.0 8.33 15 II 11.0 9.26 26 III 11.0 7.81 37 IV 8.0 8.47
5 5 I 14.0 9.96 16 II 14.0 8.10 27 III 14.0 8.84 38 IV 8.0 7.04
6 6 I 6.0 7.24 17 II 6.0 6.13 28 III 6.0 6.08 39 IV 8.0 5.25
7 7 I 4.0 4.26 18 II 4.0 3.10 29 III 4.0 5.39 40 IV 19.0 12.50
8 8 I 12.0 10.84 19 II 12.0 9.13 30 III 12.0 8.15 41 IV 8.0 5.56
9 9 I 7.0 4.82 20 II 7.0 7.26 31 III 7.0 6.42 42 IV 8.0 7.91
10 10 I 5.0 5.68 21 II 5.0 4.74 32 III 5.0 5.73 43 IV 8.0 6.89
anscombe.query("dataset=='I'")
dataset x y
0 I 10.0 8.04
1 I 8.0 6.95
2 I 13.0 7.58
3 I 9.0 8.81
4 I 11.0 8.33
5 I 14.0 9.96
6 I 6.0 7.24
7 I 4.0 4.26
8 I 12.0 10.84
9 I 7.0 4.82
10 I 5.0 5.68
# Next, we calculate the descriptive stats and 
# find that these are nearly identical for the four datasets.

pd.concat([anscombe.query("dataset=='I'")[['x','y']].reset_index(drop=True), 
           anscombe.query("dataset=='II'")[['x','y']].reset_index(drop=True), 
           anscombe.query("dataset=='III'")[['x','y']].reset_index(drop=True), 
           anscombe.query("dataset=='IV'")[['x','y']].reset_index(drop=True)],axis=1).describe()
x y x y x y x y
count 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000 11.000000
mean 9.000000 7.500909 9.000000 7.500909 9.000000 7.500000 9.000000 7.500909
std 3.316625 2.031568 3.316625 2.031657 3.316625 2.030424 3.316625 2.030579
min 4.000000 4.260000 4.000000 3.100000 4.000000 5.390000 8.000000 5.250000
25% 6.500000 6.315000 6.500000 6.695000 6.500000 6.250000 8.000000 6.170000
50% 9.000000 7.580000 9.000000 8.140000 9.000000 7.110000 8.000000 7.040000
75% 11.500000 8.570000 11.500000 8.950000 11.500000 7.980000 8.000000 8.190000
max 14.000000 10.840000 14.000000 9.260000 14.000000 12.740000 19.000000 12.500000
# But when we plot the 4 datasets, we find a completely different picture
# that we as humans find extremely easy to interpret, but wasn't visible
# through the descriptive stats.

sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=anscombe,
           col_wrap=2, ci=None, palette="muted", height=5,
           scatter_kws={"s": 50, "alpha": 1}, line_kws={"lw":1,"alpha": .5, "color":"black"})
plt.rcParams['font.size'] = 14;

png

The datasaurus dataset

The data sets were created by Justin Matejka and George Fitzmaurice (see https://www.autodesk.com/research/publications/same-stats-different-graphs), inspired by the datasaurus set from Alberto Cairo (see http://www.thefunctionalart.com/2016/08/download-datasaurus-never-trust-summary.html).

Downloaded from https://www.openintro.org/data/index.php?data=datasaurus

# Load data
datasaurus = pd.read_csv('datasaurus.csv')
datasaurus
dataset x y
0 dino 55.384600 97.179500
1 dino 51.538500 96.025600
2 dino 46.153800 94.487200
3 dino 42.820500 91.410300
4 dino 40.769200 88.333300
... ... ... ...
1841 wide_lines 33.674442 26.090490
1842 wide_lines 75.627255 37.128752
1843 wide_lines 40.610125 89.136240
1844 wide_lines 39.114366 96.481751
1845 wide_lines 34.583829 89.588902

1846 rows × 3 columns

1846/13
142.0
# Plot the data

sns.lmplot(x="x", y="y", col="dataset", hue="dataset", data=datasaurus,
           col_wrap=6, ci=None, palette="muted", height=5,
           scatter_kws={"s": 50, "alpha": 1}, line_kws={"lw":1,"alpha": .5, "color":"black"})
plt.rcParams['font.size'] = 14;

png


Line Charts

Lineplots are a basic chart type where data points are joined by line segments from left to right. You need two variables for a lineplot, both x and y have to be specified.

Generally, the variable x will need to be sorted before the plotting is done (else you could end up with a jumbled line).

Let us use GDP data in the dataset `macrodata’.

# Let us load 'macrodata'

df = sm.datasets.macrodata.load_pandas()['data']
df
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
0 1959.0 1.0 2710.349 1707.4 286.898 470.045 1886.9 28.980 139.7 2.82 5.8 177.146 0.00 0.00
1 1959.0 2.0 2778.801 1733.7 310.859 481.301 1919.7 29.150 141.7 3.08 5.1 177.830 2.34 0.74
2 1959.0 3.0 2775.488 1751.8 289.226 491.260 1916.4 29.350 140.5 3.82 5.3 178.657 2.74 1.09
3 1959.0 4.0 2785.204 1753.7 299.356 484.052 1931.3 29.370 140.0 4.33 5.6 179.386 0.27 4.06
4 1960.0 1.0 2847.699 1770.5 331.722 462.199 1955.5 29.540 139.6 3.50 5.2 180.007 2.31 1.19
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
198 2008.0 3.0 13324.600 9267.7 1990.693 991.551 9838.3 216.889 1474.7 1.17 6.0 305.270 -3.16 4.33
199 2008.0 4.0 13141.920 9195.3 1857.661 1007.273 9920.4 212.174 1576.5 0.12 6.9 305.952 -8.79 8.91
200 2009.0 1.0 12925.410 9209.2 1558.494 996.287 9926.4 212.671 1592.8 0.22 8.1 306.547 0.94 -0.71
201 2009.0 2.0 12901.504 9189.0 1456.678 1023.528 10077.5 214.469 1653.6 0.18 9.2 307.226 3.37 -3.19
202 2009.0 3.0 12990.341 9256.0 1486.398 1044.088 10040.6 216.385 1673.9 0.12 9.6 308.013 3.56 -3.44

203 rows × 14 columns

# Next, we graph the data

df = sm.datasets.macrodata.load_pandas()['data']
plt.figure(figsize = (14,5))
sns.lineplot(data = df.drop_duplicates('year', keep = 'last'), x = 'year', y = 'realgdp');

png

Common lineplot mistakes

Make sure the data is correctly sorted, else you get a jumbled line which means nothing.

# Here, we jumble the data first before plotting.  And we get an incoherent graphic

plt.figure(figsize = (14,5))
sns.lineplot(data = df.drop_duplicates('year', keep = 'last').sample(frac=1), 
             sort= False, x = 'year', y = 'realgdp');

png

# Let us see a sample of the jumbled/unordered data

df.drop_duplicates('year', keep = 'last').sample(frac=1).head(3)
year quarter realgdp realcons realinv realgovt realdpi cpi m1 tbilrate unemp pop infl realint
27 1965.0 4.0 3724.014 2314.3 446.493 544.121 2594.1 31.88 169.1 4.35 4.1 195.539 2.90 1.46
51 1971.0 4.0 4446.264 2897.8 524.085 516.140 3294.2 41.20 230.1 3.87 6.0 208.917 2.92 0.95
107 1985.0 4.0 6955.918 4600.9 969.434 732.026 5193.9 109.90 621.4 7.14 7.0 239.638 5.13 2.01
# Next, we look at using lineplots for categorical data, 
# which is not a good idea!
#
# let us load the planets dataset.
# It lists over a thousand exoplanets, and how each was discovered.
df = sns.load_dataset('planets')
df
method number orbital_period mass distance year
0 Radial Velocity 1 269.300000 7.10 77.40 2006
1 Radial Velocity 1 874.774000 2.21 56.95 2008
2 Radial Velocity 1 763.000000 2.60 19.84 2011
3 Radial Velocity 1 326.030000 19.40 110.62 2007
4 Radial Velocity 1 516.220000 10.50 119.47 2009
... ... ... ... ... ... ...
1030 Transit 1 3.941507 NaN 172.00 2006
1031 Transit 1 2.615864 NaN 148.00 2007
1032 Transit 1 3.191524 NaN 174.00 2007
1033 Transit 1 4.125083 NaN 293.00 2008
1034 Transit 1 4.187757 NaN 260.00 2008

1035 rows × 6 columns

# let us look at the data on which methods where used to find planets

df.method.value_counts()
method
Radial Velocity                  553
Transit                          397
Imaging                           38
Microlensing                      23
Eclipse Timing Variations          9
Pulsar Timing                      5
Transit Timing Variations          4
Orbital Brightness Modulation      3
Astrometry                         2
Pulsation Timing Variations        1
Name: count, dtype: int64
# This could well have been
# Another bad example

planet = df.method.value_counts()
plt.figure(figsize = (6,5))
sns.lineplot(data = planet, x = np.asarray(planet.index), y = planet.values)
plt.xticks(rotation=90)
plt.title('Exoplanets - Methods of Discovery', fontsize = 14)
plt.ylabel('Number of planets')
plt.xlabel('Method used for discovering planet')
Text(0.5, 0, 'Method used for discovering planet')

png

# This could well have been
# Another bad example

planet = planet.sample(df.method.nunique())
plt.figure(figsize = (6,5))
sns.lineplot(data = planet, x = np.asarray(planet.index), y = planet.values)
plt.xticks(rotation = 90)
plt.title('Exoplanets - Methods of Discovery', fontsize = 14)
plt.ylabel('Number of planets')
plt.xlabel('Method used for discovering planet')
Text(0.5, 0, 'Method used for discovering planet')

png

# The correct way
planet = df.method.value_counts()
sns.barplot(x=planet.index, y=planet.values)
plt.xticks(rotation=90)
plt.title('Exoplanets - Methods of Discovery', fontsize = 14)
plt.ylabel('Number of planets')
plt.xlabel('Method used for discovering planet')
Text(0.5, 0, 'Method used for discovering planet')

png


Heatmaps

We have all seen heatmaps, they are great at focusing our attention on the observations that are at the extremes, and different from the rest.

Heatmaps take three variables – 2 discrete variables for the axes, and one variable whose value is plotted.

A heatmap provides a grid-like visual where the intersection of the 2 axes is colored according to the value of the variable.

Let us consider the flights dataset, which is a monthly time series by year and month of the number of air passengers. Below is a heatmap of the data, with month and year as the axes, and the number of air passengers providing the input for the heatmap color.

pd.__version__
'2.0.3'
flights_long = sns.load_dataset("flights")
flights = flights_long.pivot(index = "month", columns = "year", values= "passengers")
plt.figure(figsize = (8,8))
sns.heatmap(flights, annot=True, fmt="d");

png

Heatmaps for correlations
Because correlations vary between -1 and +1, heatmaps allow a consistent way to visualize and present correlation information.

Combined with the flexibility Pandas allow for creating a correlation matrix, correlation heatmaps are easy to build.

## Let us look at correlations in the diamonds dataset

df = sns.load_dataset('diamonds')
df.corr(numeric_only=True)
carat depth table price x y z
carat 1.000000 0.028224 0.181618 0.921591 0.975094 0.951722 0.953387
depth 0.028224 1.000000 -0.295779 -0.010647 -0.025289 -0.029341 0.094924
table 0.181618 -0.295779 1.000000 0.127134 0.195344 0.183760 0.150929
price 0.921591 -0.010647 0.127134 1.000000 0.884435 0.865421 0.861249
x 0.975094 -0.025289 0.195344 0.884435 1.000000 0.974701 0.970772
y 0.951722 -0.029341 0.183760 0.865421 0.974701 1.000000 0.952006
z 0.953387 0.094924 0.150929 0.861249 0.970772 0.952006 1.000000
# Next, let us create a heatmap to see where the 
# high +ve, low and high -ve correlations lie

sns.heatmap(data = df.corr(numeric_only=True), annot= True, cmap='coolwarm');

png


Pairplots

Pairplots are a great way to visualize multiple variables at the same time in a single graphic where the axes are shared.

The relationship between all combinations of variables is shown as a scatterplot, and the distribution of each variable appears in the diagonal.

Let us consider the diamonds dataset again. Below is a pairplot based on table, price, x and y variables.

sns.pairplot(data = df[['table', 'price', 'x', 'y', 'color']].sample(100));

png

It is possible to add additional dimensions to color the points plotted (just as we could with scatterplots).

The graphic next shows the same plot as in the prior slide, but with ‘hue’ set to be the diamond’s color. Note that the univariate diagonal has changed from histograms to KDE.

sns.pairplot(data = df[['table', 'price', 'x', 'y', 'color']].sample(100), hue = 'color');

png

Lying with Graphs

Source: https://www.nationalgeographic.com/science/article/150619-data-points-five-ways-to-lie-with-charts

It is extremely easy to manipulate visualizations to present a story that doesn't exist, or is plainly wrong.

Consider the graphs below. What is wrong with them?

image.png

image.png

image.png

image.png

image.png

image.png

In the chart on the left, the percentage of Christians is the biggest value, but a larger amount of green shows for Muslims because of the 3-D effect.

Discuss: What story is the chart below trying to tell? What is wrong with the chart?

Source: CB Insights Newsletter dated 9/9/2021

image.png


VISUALIZATION NOTEBOOK ENDS HERE


Below are graphics used to show activation functions for a future chapter

Visualizing Activation Functions

def sigmoid(x):
    result = 1/(1+np.exp(-x))
    df = pd.DataFrame(data=result, index = x, columns=['Sigmoid'])
    return df

def tanh(x):
    result = (np.exp(x) - np.exp(-x))/(np.exp(x) + np.exp(-x))
    df = pd.DataFrame(data=result, index = x, columns=['TANH'])
    return df


def relu(x):
    result = np.maximum(0,x)
    df = pd.DataFrame(data=result, index = x, columns=['RELU'])
    return df

def leakyrelu(x):
    val = []
    index =[]
    for item in list(x):

        if item <0:
            result = 0.01 * item
        else:
            result = item
        val.append(result)
        index.append(item)
        df = pd.DataFrame(val, columns=['Leaky RELU'])
        df.index = index
    return df
relu(np.arange(-12,12,.1)).plot()
plt.hlines(0, xmin = -15, xmax=10, color='black', alpha=.2)
<matplotlib.collections.LineCollection at 0x1d4dcfd93f0>

png

sigmoid(np.arange(-12,12,.1)).plot()
plt.hlines(0, xmin = -15, xmax=10, color='black', alpha=.2)
<matplotlib.collections.LineCollection at 0x1d4dd08d150>

png

leakyrelu(np.arange(-45,6,.1)).plot()
plt.hlines(0, xmin = -45, xmax=10, color='black', alpha=.2)
<matplotlib.collections.LineCollection at 0x1d4de84eaa0>

png

tanh(np.arange(-12,12,.1)).plot()
plt.hlines(0, xmin = -12, xmax=12, color='black', alpha=.2)
<matplotlib.collections.LineCollection at 0x1d4de891690>

png