Exploratory Data Analysis
What is EDA?
EDA is the unstructured process of probing the data we haven’t seen before to understand more about it with a view to thinking about how we can use the data, and to discover what it reveals as insights at first glance.
At other times, we need to analyze some data with no particular objective in mind except to find out if it could be useful for anything at all.
Consider a situation where your manager points you to some data and asks you to do some analysis on it. The data could be in a Google Drive, or a Github repo, or on a thumb drive. It may have been received from a client, a customer or a vendor. You may have a high level pointer to what the data is, for example you may know there is order history data, or invoice data, or web log data. The ask may not be very specific, nor the goal clarified, but we would like to check the data out to see if there is something useful we can do with it.
In other situations, we are looking for something specific, and are looking for the right data to analyze. For example, we may be trying to to identify zip codes where to market our product. We may be able to get data that provides us information on income, consumption, population characteristics etc that could help us with our task. When we receive such data, we would like to find out if it is fit for purpose.
Inquiries to conduct
So when you get data that you do not know much about in advance, you start with exploratory data analysis, or EDA. Possible inquiries you might like to conduct are:
- How much data do we have - number of rows in the data?
- How many columns, or fields do we have in the dataset?
- Data types - which of the columns appear to be numeric, dates or strings?
- Names of the columns, and do they tell us anything?
- A visual review of a sample of the dataset
- Completeness of the dataset, are missing values obvious? Columns that are largely empty?
- Unique values for columns that appear to be categorical, and how many observations of each category?
- For numeric columns, the range of values (calculated from min and max values)
- Distributions for the different columns, possibly graphed
- Correlations between the different columns
Exploratory Data Analysis (EDA) is generally the first activity performed to get a high level understanding of new data. It employs a variety of graphical and summarization techniques to get a ‘sense of the data’.
The purpose of Exploratory Data Analysis is to interrogate the data in an open-minded way with a view to understanding the structure of the data, uncover any prominent themes, identify important variables, detect obvious anomalies, consider missing values, review data types, obtain a visual understanding of the distribution of the data, understand correlations between variables, etc. Not all these things can be discovered during EDA, but these are generally the things we look for when performing EDA.
EDA is unstructured exploration, there is not a defined set of activities you must perform. Generally, you probe the data, and depending upon what you discover, you ask more questions.
Introduction to Arrays
Arrays, or collection of numbers, are fundamental to analytics at scale. We will cover arrays from a NumPy lens exclusively, given how much NumPy dominates all array based manipulation.
NumPy is the underlying library for manipulating arrays in Python. And arrays are really important for analytics. The reason arrays are important is because many analytical algorithms will only accept arrays as input. Deep learning networks will exclusively accept only arrays as input, though arrays are called tensors in the deep learning world. In addition to this practical issue, data is much easier to manipulate, transform and perform mathematical operations on if it is expressed as an array.
NumPy underpins pandas as well as many other libraries. So we may not be using it a great deal, but there will be situations where numpy is unavoidable.
Below is a high level overview of what arrays are, and some basic array operations.
Multi-dimensional data
Arrays have structure in the form of dimensions, and numbers sit at the intersection of these dimensions. In a spreadsheet, you see two dimensions - one being the rows, represented as 1, 2, 3..., and the other the columns, repesented as A, B, C. Numpy arrays can have any number of dimensions, even though dimensions beyond the third are humanly impossible to visualize.
A numpy array when printed in Python encloses data for a dimension in square brackets. The fundamental unit of an array of any size is a single one-dimensional row where numbers are separated by commas and enclosed in a set of square brackets, for example, [1, 2, 3, 1]
. Several of these will then be arranged within additional nested square brackets to make up the complete array. To understand the idea of an array, mentally visualize a 2-dimensional array similar to a spreadsheet. Every number within the array exists at the intersection of all of its dimensions. In Python, each position along a dimension, more commonly called an axis, is represented by numbers starting with the first element being 0. These positions are called indexes.
The number of square brackets [
gives the number of dimensions in the array. Two are represented on screen, the rows and columns, like a 2D matrix. But the screen is two-dimensional, and cannot display additional dimensions. Therefore all other dimensions appear as repeats of rows and columns - look at the example next. The last two dimensions, eg here 3, 4 represent rows and columns. The 2, the first one, means there are two sets of these rows and columns in the array!
Creating arrays with Numpy
Everything that Numpy touches ends as an array, just like everything from a pandas function is a dataframe. Easiest way to generate a random array is np.random.randn(2,3)
which will give an array with dimensions 2,3. You can pick any other dimensions too. randn
gives random normal numbers.
# import some libraries
import pandas as pd
import os
import random
import numpy as np
import scipy
import math
import joblib
# Create a one dimensional array
np.random.randn(4)
array([ 1.1736499 , 1.54772703, -0.21693701, 0.31459622])
# Create a 2-dimensional array with random normal variables
# np.random.seed(123)
np.random.randn(2,3)
array([[-0.4905774 , -1.47052507, -1.04379812],
[-0.20386335, 0.56686123, 1.16730192]])
# Create a 3-dimensional array with random integers
x = np.random.randint(low = 1, high = 5, size = (2,3,4))
print('Shape: ', x.shape)
x
Shape: (2, 3, 4)
array([[[3, 3, 2, 1],
[2, 1, 1, 2],
[1, 1, 1, 3]],
[[1, 4, 1, 1],
[3, 2, 1, 1],
[2, 4, 2, 3]]])
Numpy axes numbers run from left to right, starting with the index 0. So x.shape
gives me 2, 3, 4 which means 2 is the 0th axis, 3 rows are the 1st axis and 4 columns are the 2nd axis.
The shape of the above array is (2, 3, 4)
axis = 0 means : (2, 3, 4)
axis = 1 means : (2, 3, 4)
axis = 2 means : (2, 3, 4)
# Create a 3-dimensional array
data = np.random.randn(2, 3, 4)
print('The shape of the array is:', data.shape)
data
The shape of the array is: (2, 3, 4)
array([[[-0.94207619, 0.85819949, 0.85937811, 0.03423557],
[ 0.43471567, -0.3977568 , -0.38560239, 1.37103135],
[-0.8236544 , -0.75445943, 0.34979668, 0.46855885]],
[[ 1.12936861, -0.44238069, 0.96649123, -1.36034059],
[ 0.64099078, 1.41112827, -0.58302938, 0.0526134 ],
[ 1.6253795 , 0.47798241, 0.53996765, -0.77834533]]])
The number of
[
gives the number of dimensions in the array.
Two are represented on screen, the rows and columns. All others appear afterwards. The last two dimensions, eg here 3, 4 represent rows and columns. The 2, the first one, means there are two sets of these rows and columns in the array.
np.random.randn(4, 3, 2)
array([[[-1.84207847, -0.22688959],
[ 1.40275113, 0.74415778],
[-0.15042182, -0.75451819]],
[[ 0.58001497, 0.9170983 ],
[ 0.253829 , 1.08733683],
[-0.6430149 , 2.01905416]],
[[ 0.15379162, -0.07853098],
[-0.85224692, 0.25954211],
[ 0.0392591 , 0.29043794]],
[[-0.85687338, -0.90593571],
[-1.28917985, 0.25920641],
[ 1.48624977, -0.27429377]]])
# Now let us add another dimension. But this time random integers than random normal.
# The random integer function (randint) requires specifying low and high for the uniform distribution.
data = np.random.randint(low = 1, high = 100, size = (2,3,2,4))
data
array([[[[50, 35, 31, 23],
[67, 69, 34, 61]],
[[70, 83, 55, 30],
[47, 76, 54, 61]],
[[73, 14, 87, 13],
[ 8, 95, 6, 13]]],
[[[73, 33, 25, 83],
[48, 73, 44, 83]],
[[54, 24, 72, 3],
[ 1, 91, 30, 60]],
[[62, 39, 74, 77],
[14, 33, 8, 83]]]])
So there will be a collection of 2 rows x 4 columns matrices, repeated 3 times, and that entire set another 2 times.
And the 4 occurrences of [[[[
means there are 4 dimensions to the array.
type(data)
numpy.ndarray
# Converting a list to an array
list1 = list(range(12))
list1
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
array1 = np.array(list1)
array1
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
# This array1 is one dimensional, let us convert to a 3x4 array.
array1.shape = (3,4)
array1
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
# Create arrays of zeros
array1 = np.zeros((2,3)) # The dimensions must be a tuple inside the brackets
array1
array([[0., 0., 0.],
[0., 0., 0.]])
# Create arrays from a range
array1 = np.arange((12))
array1
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])
#You can reshape the dimensions of an array
array1.reshape(3,4)
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
array1.reshape(3,2,2)
array([[[ 0, 1],
[ 2, 3]],
[[ 4, 5],
[ 6, 7]],
[[ 8, 9],
[10, 11]]])
# Create an array of 1's
array1 = np.ones((3,5))
array1
array([[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.],
[1., 1., 1., 1., 1.]])
# Creates the identity matrix
array1 = np.eye(4)
array1
array([[1., 0., 0., 0.],
[0., 1., 0., 0.],
[0., 0., 1., 0.],
[0., 0., 0., 1.]])
# Create an empty array - useful if you need a place to keep data that will be generated later in the code.
# It shows zeros but is actually empty
np.empty([2,3])
array([[6.23042070e-307, 4.67296746e-307, 1.69121096e-306],
[9.34609111e-307, 1.42413555e-306, 1.78019082e-306]])
Summarizing data along an axis
Putting the axis = n
argument with a summarization function (eg, sum) makes the axis n disappear, having been summarized into the function's results, leaving only the rest of the dimensions. So np.sum(array_name, axis = n)
, similarly mean()
, min()
, median()
, std()
etc will calculate the aggregation function by collapsing all the elements of the selected axis number into one and performing that operation. See below using the sum function.
x = data = np.random.randint(low = 1, high = 100, size = (2,3))
x
array([[17, 58, 47],
[63, 98, 94]])
# So with axis = 0, the very first dimension, ie the 2 rows, will collapse leaving an array of shape (3,)
x.sum(axis = 0)
array([ 80, 156, 141])
# So with axis = 0, the very first dimension, ie the 2 rows, will collapse leaving an array of shape (2,)
x.sum(axis = 1)
array([122, 255])
Subsetting arrays ('slices')
Python starts numbering things starting with zero, which means the first item is the 0th item.
The portion of the dimension you wish to select is given in the form start:finish
where the start
element is included, but the finish
is excluded. So 1:3
means include 1 and 2 but not 3.
:
means include everything
array1 = np.random.randint(0, 100, (3,5))
array1
array([[46, 19, 51, 42, 76],
[80, 27, 40, 28, 81],
[34, 37, 87, 93, 97]])
array1[0:2, 0:2]
array([[46, 19],
[80, 27]])
array1[:,0:2] # ':' means include everything
array([[46, 19],
[80, 27],
[34, 37]])
array1[0:2]
array([[46, 19, 51, 42, 76],
[80, 27, 40, 28, 81]])
#Slices are references to the original array. So you if you need a copy, use the below:
array1[0:2].copy()
array([[46, 19, 51, 42, 76],
[80, 27, 40, 28, 81]])
Generally, use the above 'Long Form' way for slicing where you specify the indices for each dimension. Where everything is to be included, use :
. There are other short-cut methods of slicing, but can leave those as is.
Imagine an array a1 with dimensions (3, 5, 2, 4). This means: - This array has 3 arrays in it that have the dimensions (5, 2, 4) - Each of these 3 arrays have 5 additional arrays each in them of the dimension (2,4). (So there are 3*5=15 of these 2x4 arrays) - Each of these (2,4) arrays has 2 one-dimensional arrays with 4 columns.
If in the slice notation only a portion of what to include is specified, eg a1[0], then it means we are asking for the first one of these axes, ie the dimension parameters are specifying from the left of (3, 5, 2, 4). It means give me the first of the 3 arrays with size (5,2,4).
If the slice notation says a1[0,1], then it means 0th element of the first dim, and 1st element of the second dim.
Check it out using the following code:
a1 = np.random.randint(0, 100, (3,4,2,5))
a1
array([[[[59, 41, 61, 8, 39],
[73, 32, 61, 51, 6]],
[[69, 3, 25, 8, 46],
[67, 65, 13, 83, 88]],
[[79, 17, 61, 24, 86],
[97, 47, 49, 53, 55]],
[[77, 52, 43, 40, 74],
[51, 39, 97, 66, 19]]],
[[[54, 88, 81, 40, 95],
[74, 61, 27, 53, 92]],
[[ 9, 57, 21, 87, 73],
[99, 6, 77, 63, 76]],
[[73, 31, 94, 85, 65],
[95, 78, 27, 83, 44]],
[[75, 63, 71, 49, 43],
[54, 4, 93, 75, 70]]],
[[[87, 0, 13, 69, 0],
[81, 13, 88, 24, 36]],
[[21, 19, 30, 32, 55],
[40, 21, 74, 89, 68]],
[[80, 34, 75, 13, 9],
[63, 19, 73, 12, 47]],
[[27, 29, 45, 65, 43],
[83, 21, 11, 45, 6]]]])
a1[0].shape
(4, 2, 5)
a1[0]
array([[[59, 41, 61, 8, 39],
[73, 32, 61, 51, 6]],
[[69, 3, 25, 8, 46],
[67, 65, 13, 83, 88]],
[[79, 17, 61, 24, 86],
[97, 47, 49, 53, 55]],
[[77, 52, 43, 40, 74],
[51, 39, 97, 66, 19]]])
a1[0,1]
array([[69, 3, 25, 8, 46],
[67, 65, 13, 83, 88]])
More slicing: Picking selected rows or columns
a1 = np.random.randint(0, 100, (8,9))
a1
array([[44, 78, 29, 91, 82, 86, 76, 3, 90],
[69, 96, 29, 79, 25, 47, 95, 87, 85],
[79, 42, 99, 88, 14, 38, 47, 62, 41],
[39, 98, 27, 95, 65, 2, 59, 72, 16],
[46, 44, 55, 65, 32, 5, 79, 40, 65],
[38, 76, 78, 8, 76, 35, 27, 32, 51],
[58, 75, 3, 99, 23, 73, 77, 12, 39],
[66, 43, 58, 35, 33, 85, 75, 8, 10]])
# Select the first row
a1[0]
array([44, 78, 29, 91, 82, 86, 76, 3, 90])
# Select the fourth row
a1[3]
array([39, 98, 27, 95, 65, 2, 59, 72, 16])
# Select the first and the fourth row together
a1[[0,3]]
array([[44, 78, 29, 91, 82, 86, 76, 3, 90],
[39, 98, 27, 95, 65, 2, 59, 72, 16]])
# Select the first and the fourth column
a1[:,[0,3]]
array([[44, 91],
[69, 79],
[79, 88],
[39, 95],
[46, 65],
[38, 8],
[58, 99],
[66, 35]])
# Select subset of named rows and columns
a1[[0, 3]][:,[0, 1]] # Named rows and columns.
# Note that a1[[0, 3],[0, 1]] does not work as expected, it selects two points (0,0)and (3,1).
# Really crazy but it is what it is.
array([[44, 78],
[39, 98]])
### Operations on arrays All math on arrays is element wise, and scalars are multiplied/added with each element.
array1 + 4
array([[68, 82, 45, 28, 15],
[86, 33, 19, 37, 21],
[29, 72, 94, 70, 62]])
array1 > np.random.randint(0, 2, (3,5))
array([[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]])
array1 + 2
array([[66, 80, 43, 26, 13],
[84, 31, 17, 35, 19],
[27, 70, 92, 68, 60]])
np.sum(array1) # adds all the elements of an array
701
np.sum(array1, axis = 0) # adds all elements of the array along a particular axis
array([171, 175, 146, 123, 86])
Matrix math
Numpy has arrays as well as matrices. Matrices are 2D, arrays can have any number of dimensions. The only real difference between a matrix (type = numpy.matrix
) and an array (type = numpy.ndarray
) is that all array operations are element wise, ie the special R x C matrix multiplication does not apply to arrays. However, for an array that is 2 x 2 in shape you can use the @
operator to do matrix math.
So that leaves matrices and arrays interchangeable in a practical sense. Except that you can't do an inverse of an array using .I
which you can for a matrix.
# Create a matrix 'm' and an array 'a' that are identical
m = np.matrix(np.random.randint(0,10,(3,3)))
a = np.array(m)
m
matrix([[4, 2, 7],
[7, 0, 2],
[9, 3, 4]])
a
array([[4, 2, 7],
[7, 0, 2],
[9, 3, 4]])
Transpose with a .T
m.T
matrix([[4, 7, 9],
[2, 0, 3],
[7, 2, 4]])
a.T
array([[4, 7, 9],
[2, 0, 3],
[7, 2, 4]])
Inverse with a .I
Does not work for arrays
m.I
matrix([[-0.05825243, 0.12621359, 0.03883495],
[-0.09708738, -0.45631068, 0.39805825],
[ 0.2038835 , 0.05825243, -0.13592233]])
Matrix multiplication
For matrices, just a *
suffices for matrix multiplication. If using arrays, use @
for matrix multiplication, which also works for matrices. So just to be safe, just use @
.
Dot-product is the same as row-by-column matrix multiplication, and is not elementwise.
a=np.matrix([[4, 3], [2, 1]])
b=np.mat([[1, 2], [3, 4]])
a
matrix([[4, 3],
[2, 1]])
b
matrix([[1, 2],
[3, 4]])
a*b
matrix([[13, 20],
[ 5, 8]])
a@b
matrix([[13, 20],
[ 5, 8]])
# Now check with arrays
a=np.array([[4, 3], [2, 1]])
b=np.array([[1, 2], [3, 4]])
a@b # does matrix multiplication.
array([[13, 20],
[ 5, 8]])
a
array([[4, 3],
[2, 1]])
b
array([[1, 2],
[3, 4]])
a*b # element-wise multiplication as a and b are arrays
array([[4, 6],
[6, 4]])
@
is the same as np.dot(a, b)
, which is just a longer fully spelled out function.
np.dot(a,b)
array([[13, 20],
[ 5, 8]])
Exponents with matrices and arrays **
.
a = np.array([[4, 3], [2, 1]])
m = np.matrix(a)
m
matrix([[4, 3],
[2, 1]])
a**2 # Because a is an array, this will square each element of a.
array([[16, 9],
[ 4, 1]], dtype=int32)
m**2 # Because m is a matrix, this will be read as m*m, and dot product of the matrix with itself will result.
matrix([[22, 15],
[10, 7]])
which is same as a@a
a@a
array([[22, 15],
[10, 7]])
Modulus, or size
The modulus is just sqrt(a^2 + b^2 + ....n^2)
, where a, b...n are elements of the vector, matrix or array. Can be calculated using np.linalg.norm(a)
a = np.array([4,3,2,1])
np.linalg.norm(a)
5.477225575051661
# Same as calculating manually
(4**2 + 3**2 + 2**2 + 1**2) ** 0.5
5.477225575051661
b
array([[1, 2],
[3, 4]])
np.linalg.norm(b)
5.477225575051661
m
matrix([[4, 3],
[2, 1]])
np.linalg.norm(m)
5.477225575051661
m = np.matrix(np.random.randint(0,10,(3,3)))
m
matrix([[1, 4, 5],
[2, 3, 6],
[4, 6, 6]])
np.linalg.norm(m)
13.379088160259652
print(np.ravel(m))
print(type(np.ravel(m)))
print('Manual calculation for norm')
((np.ravel(m)**2).sum())**.5
[1 4 5 2 3 6 4 6 6]
<class 'numpy.ndarray'>
Manual calculation for norm
13.379088160259652
Determinant of a matrix np.linalg.det(a)
Used for calculating the inverse of a matrix, and only applies to square matrices.
np.linalg.det(m)
30.000000000000014
Converting from matrix to array and vice-versa
np.asmatrix
and np.asarray
allow you to convert one to the other. Though above we have just used np.array and np.matrix without any issue.
The above references: https://stackoverflow.com/questions/4151128/what-are-the-differences-between-numpy-arrays-and-matrices-which-one-should-i-u
Distances and angles between vectors
Size of a vector, angle between vectors, distance between vectors
# We set up two vectors a and b
a = np.array([1,2,3]); b = np.array([5,4,3])
print('a =',a)
print('b =',b)
a = [1 2 3]
b = [5 4 3]
# Size of the vector, computed as the root of the squares of each of the elements
np.linalg.norm(a)
3.7416573867739413
# Distance between two vectors
np.linalg.norm(a - b)
4.47213595499958
# Which is the same as
print(np.sqrt(np.dot(a, a) - 2 * np.dot(a, b) + np.dot(b, b)))
(a@a + b@b - 2*a@b)**.5
4.47213595499958
4.47213595499958
# Combine the two vectors
X = np.concatenate((a,b)).reshape(2,3)
X
array([[1, 2, 3],
[5, 4, 3]])
# Euclidean distance is the default metric for this function
# from sklearn
from sklearn.metrics import pairwise_distances
pairwise_distances(X)
array([[0. , 4.47213595],
[4.47213595, 0. ]])
# Angle in radians between two vectors. To get the
# answer in degrees, multiply by 180/pi, or 180/math.pi (after import math). Also there is a function in math called
# math.radians to get radians from degrees, or math.degrees(x) to convert angle x from radians to degrees.
import math
angle_in_radians = np.arccos(np.dot(a,b) / (np.linalg.norm(a) * np.linalg.norm(b)))
angle_in_degrees = math.degrees(angle_in_radians)
print('Angle in degrees =', angle_in_degrees)
print('Angle in radians =', angle_in_radians)
Angle in degrees = 33.74461333141198
Angle in radians = 0.5889546074455115
# Same as above using math.acos instead of np.arccos
math.acos(np.dot(a,b) / (np.linalg.norm(a) * np.linalg.norm(b)))
0.5889546074455115
Sorting with argsort
Which is the same as sort, but shows index numbers instead of the values
# We set up an array
a = np.array([20,10,30,0])
# Sorted indices
np.argsort(a)
array([3, 1, 0, 2], dtype=int64)
# Using the indices to get the sorted values
a[np.argsort(a)]
array([ 0, 10, 20, 30])
# Descending sort indices
np.argsort(a)[::-1]
array([2, 0, 1, 3], dtype=int64)
# Descending sort values
a[np.argsort(a)[::-1]]
array([30, 20, 10, 0])
Understanding DataFrames
As we discussed in the prior section, understanding and manipulating arrays of numbers is fundamental to the data science process. This is because nearly all ML and AI algorithms insist on being provided data arrays as inputs, and the NumPy library underpins almost all of data science.
As we discussed, a NumPy array is essentially a collection of numbers. This collection is organized along ‘dimensions’. So NumPy objects are n-dimensional array objects, or ndarray, a fast and efficient container for large datasets in Python.
But arrays have several limitations. One huge limitation is that they are raw containers with numbers, they don't have 'headers', or labels that describe the columns, rows, or the additional dimensions. This means we need to track separately somewhere what each of the dimensions mean. Another limitation is that after 3 dimensions, the additional dimensions are impossible toto visualize in the human mind. For most practical purposes, humans like to think of data in the tabular form, with just rows and columns. If there are more dimensions, one can have multiple tables.
This is where pandas steps in. Pandas use dataframes, or a spreadsheet like construct where there are rows and columns, and these rows and columns can have names or headings. Pandas dataframes are easily converted to NumPy arrays, and algorithms will mostly accept a dataframe as an input just as they would an array.
Exploring Tabular Data with Pandas
Tabular data is often the most common data type that is encountered, though ‘unstructured’ data is increasingly becoming common. Tabular data is two dimensional data – with rows and columns. The columns are defined and understood, and we generally understand what they contain.
- Data is laid out as a 2-dimensional matrix, whether in a spreadsheet, or R/Python dataframes, or in a database table.
- Rows generally represent individual observations, while columns are the fields/variables.
- Variables can be numeric, or categorical.
- Numerical variables can be integers, floats etc, and are continuous.
- Categorical variables may be cardinal (eg, species, gender), or ordinal (eg, low, medium, high), and belong to a discrete set.
- Categorical variables are also called factors, and levels.
- Algorithms often require categorical variables to be converted to numerical variables.
Unstructured data includes audio, video and other kinds of data that is useful for problems of perception. Unstructured data will almost invariably need to be converted into structured arrays with defined dimensions, but for the moment we will skip that.
Reading data with Pandas
Pandas offer several different functions for reading different types of data.
read_csv
: Load comma separated files
read_table
: Load tab separated files
read_fwf
: Read data in fixed-width column format (i.e., no delimiters)
read_clipboard
Read data from the clipboard; useful for converting tables from web pages
read_excel
: Read Excel files
read_html
: Read all tables found in the given HTML document
read_json
: Read data from a JSON (JavaScript Object Notation) file
read_pickle
: Read a pickle file
read_sql
: Read results of an SQL query
read_sas
: Read SAS files
Other data types in Python
- Lists are represented as
[]
. Lists are a changeable collection of elements, and the elements can be any Python data, eg strings, numbers, dictionaries, or even other lists. - Dictionaries are enclosed in
{}
. These are 'key:value' pairs, where 'key' is almost like a name given to a 'value'. - Sets are also enclosed in
{}
, except they don't have the colons separating the key:value pairs. These are collections of items, and they are unordered. - Tuples are collections of variables, and enclosed in
()
. They are different from sets in that they are unchangeable.
# Example - creating a list
empty_list = []
list1 = ['a', 2,4, 'python']
list1
['a', 2, 4, 'python']
# Example - creating a dictionary
dict1 = {'first': ['John', 'Jane'], 'something_else': (1,2,3)}
dict1
{'first': ['John', 'Jane'], 'something_else': (1, 2, 3)}
dict1['first']
['John', 'Jane']
dict1['something_else']
(1, 2, 3)
# Checking the data type of the new variable we created
type(dict1)
dict
# Checking the data type
type(list1)
list
# Set operations
set1 = {1,2,4,5} # Sets can do intersect, union and difference
# Tuple example
tuple1 = 1, 3, 4 # or
tuple1 = (1, 3, 4)
tuple1
(1, 3, 4)
Loading built-in data sets in Python
Before we move forward with getting into the details with EDA, we will first take a small digressive detour to talk about data sets.
In order to experiment with EDA, we need some data. We can bring our own data, but for exploration and experimentation, it is often easy to load up one of the many in-built datasets accessible through Python. These datasets cover the spectrum - from really small datasets to those with many thousands of records, and include text data such as movie reviews and tweets.
We will leverage these built in datasets for the rest of the discussion as they provide a good path to creating reproducible examples. These datasets are great for experimenting, testing, doing tutorials and exercises.
The next few headings will cover these in-built datasets.
- The Statsmodels library provides access to several interesting inbuilt datasets in Python.
- The datasets available in R can also be accessed through statsmodels.
- The Seaborn library has several toy datasets available to explore.
- The Scikit Learn (sklearn) library also has in-built datasets.
- Scikit Learn also provides a function to generate random datasets with described characteristics (
make_blobs
function)
In the rest of this discussion, we will use these data sets and explore the data.
Some of these are described below, together with information on how to access and use such datasets.
# Load the regular libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
Loading data from Statsmodels
Statsmodels allows access to several datasets for use in examples, model testing, tutorials, testing functions etc. These can be accessed using sm.datasets.macrodata.load_pandas()['data']
, where macrodata
is just one example of a dataset. Pressing TAB
after sm.datasets
should bring up a pick-list of datasets to choose from.
The commands print(sm.datasets.macrodata.DESCRLONG)
and print(sm.datasets.macrodata.NOTE)
provide additional details on the datasets.
# Load macro economic data from Statsmodels
import statsmodels.api as sm
df = sm.datasets.macrodata.load_pandas()['data']
df
year | quarter | realgdp | realcons | realinv | realgovt | realdpi | cpi | m1 | tbilrate | unemp | pop | infl | realint | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1959.0 | 1.0 | 2710.349 | 1707.4 | 286.898 | 470.045 | 1886.9 | 28.980 | 139.7 | 2.82 | 5.8 | 177.146 | 0.00 | 0.00 |
1 | 1959.0 | 2.0 | 2778.801 | 1733.7 | 310.859 | 481.301 | 1919.7 | 29.150 | 141.7 | 3.08 | 5.1 | 177.830 | 2.34 | 0.74 |
2 | 1959.0 | 3.0 | 2775.488 | 1751.8 | 289.226 | 491.260 | 1916.4 | 29.350 | 140.5 | 3.82 | 5.3 | 178.657 | 2.74 | 1.09 |
3 | 1959.0 | 4.0 | 2785.204 | 1753.7 | 299.356 | 484.052 | 1931.3 | 29.370 | 140.0 | 4.33 | 5.6 | 179.386 | 0.27 | 4.06 |
4 | 1960.0 | 1.0 | 2847.699 | 1770.5 | 331.722 | 462.199 | 1955.5 | 29.540 | 139.6 | 3.50 | 5.2 | 180.007 | 2.31 | 1.19 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
198 | 2008.0 | 3.0 | 13324.600 | 9267.7 | 1990.693 | 991.551 | 9838.3 | 216.889 | 1474.7 | 1.17 | 6.0 | 305.270 | -3.16 | 4.33 |
199 | 2008.0 | 4.0 | 13141.920 | 9195.3 | 1857.661 | 1007.273 | 9920.4 | 212.174 | 1576.5 | 0.12 | 6.9 | 305.952 | -8.79 | 8.91 |
200 | 2009.0 | 1.0 | 12925.410 | 9209.2 | 1558.494 | 996.287 | 9926.4 | 212.671 | 1592.8 | 0.22 | 8.1 | 306.547 | 0.94 | -0.71 |
201 | 2009.0 | 2.0 | 12901.504 | 9189.0 | 1456.678 | 1023.528 | 10077.5 | 214.469 | 1653.6 | 0.18 | 9.2 | 307.226 | 3.37 | -3.19 |
202 | 2009.0 | 3.0 | 12990.341 | 9256.0 | 1486.398 | 1044.088 | 10040.6 | 216.385 | 1673.9 | 0.12 | 9.6 | 308.013 | 3.56 | -3.44 |
203 rows × 14 columns
# Print the description of the data
print(sm.datasets.macrodata.DESCRLONG)
US Macroeconomic Data for 1959Q1 - 2009Q3
# Print the data-dictionary for the different columns/fields in the data
print(sm.datasets.macrodata.NOTE)
::
Number of Observations - 203
Number of Variables - 14
Variable name definitions::
year - 1959q1 - 2009q3
quarter - 1-4
realgdp - Real gross domestic product (Bil. of chained 2005 US$,
seasonally adjusted annual rate)
realcons - Real personal consumption expenditures (Bil. of chained
2005 US$, seasonally adjusted annual rate)
realinv - Real gross private domestic investment (Bil. of chained
2005 US$, seasonally adjusted annual rate)
realgovt - Real federal consumption expenditures & gross investment
(Bil. of chained 2005 US$, seasonally adjusted annual rate)
realdpi - Real private disposable income (Bil. of chained 2005
US$, seasonally adjusted annual rate)
cpi - End of the quarter consumer price index for all urban
consumers: all items (1982-84 = 100, seasonally adjusted).
m1 - End of the quarter M1 nominal money stock (Seasonally
adjusted)
tbilrate - Quarterly monthly average of the monthly 3-month
treasury bill: secondary market rate
unemp - Seasonally adjusted unemployment rate (%)
pop - End of the quarter total population: all ages incl. armed
forces over seas
infl - Inflation rate (ln(cpi_{t}/cpi_{t-1}) * 400)
realint - Real interest rate (tbilrate - infl)
Importing R datasets using Statsmodels
Datasets available in R can also be imported using the command sm.datasets.get_rdataset('mtcars').data
, where mtcards
can be replaced by the appropriate dataset name.
# Import the mtcars dataset which contains attributes for 32 models of cars
mtcars = sm.datasets.get_rdataset('mtcars').data
mtcars.to_excel('mtcars.xlsx')
mtcars.describe()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.000000 | 32.0000 |
mean | 20.090625 | 6.187500 | 230.721875 | 146.687500 | 3.596563 | 3.217250 | 17.848750 | 0.437500 | 0.406250 | 3.687500 | 2.8125 |
std | 6.026948 | 1.785922 | 123.938694 | 68.562868 | 0.534679 | 0.978457 | 1.786943 | 0.504016 | 0.498991 | 0.737804 | 1.6152 |
min | 10.400000 | 4.000000 | 71.100000 | 52.000000 | 2.760000 | 1.513000 | 14.500000 | 0.000000 | 0.000000 | 3.000000 | 1.0000 |
25% | 15.425000 | 4.000000 | 120.825000 | 96.500000 | 3.080000 | 2.581250 | 16.892500 | 0.000000 | 0.000000 | 3.000000 | 2.0000 |
50% | 19.200000 | 6.000000 | 196.300000 | 123.000000 | 3.695000 | 3.325000 | 17.710000 | 0.000000 | 0.000000 | 4.000000 | 2.0000 |
75% | 22.800000 | 8.000000 | 326.000000 | 180.000000 | 3.920000 | 3.610000 | 18.900000 | 1.000000 | 1.000000 | 4.000000 | 4.0000 |
max | 33.900000 | 8.000000 | 472.000000 | 335.000000 | 4.930000 | 5.424000 | 22.900000 | 1.000000 | 1.000000 | 5.000000 | 8.0000 |
# Load the famous Iris dataset
iris = sm.datasets.get_rdataset('iris').data
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
Datasets in Seaborn
Several datasets are accessible through the Seaborn library
# Get the names of all the datasets that are available through Seaborn
import seaborn as sns
sns.get_dataset_names()
['anagrams',
'anscombe',
'attention',
'brain_networks',
'car_crashes',
'diamonds',
'dots',
'dowjones',
'exercise',
'flights',
'fmri',
'geyser',
'glue',
'healthexp',
'iris',
'mpg',
'penguins',
'planets',
'seaice',
'taxis',
'tips',
'titanic']
# Load the diamonds dataset
diamonds = sns.load_dataset('diamonds')
diamonds.head(20)
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
5 | 0.24 | Very Good | J | VVS2 | 62.8 | 57.0 | 336 | 3.94 | 3.96 | 2.48 |
6 | 0.24 | Very Good | I | VVS1 | 62.3 | 57.0 | 336 | 3.95 | 3.98 | 2.47 |
7 | 0.26 | Very Good | H | SI1 | 61.9 | 55.0 | 337 | 4.07 | 4.11 | 2.53 |
8 | 0.22 | Fair | E | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 |
9 | 0.23 | Very Good | H | VS1 | 59.4 | 61.0 | 338 | 4.00 | 4.05 | 2.39 |
10 | 0.30 | Good | J | SI1 | 64.0 | 55.0 | 339 | 4.25 | 4.28 | 2.73 |
11 | 0.23 | Ideal | J | VS1 | 62.8 | 56.0 | 340 | 3.93 | 3.90 | 2.46 |
12 | 0.22 | Premium | F | SI1 | 60.4 | 61.0 | 342 | 3.88 | 3.84 | 2.33 |
13 | 0.31 | Ideal | J | SI2 | 62.2 | 54.0 | 344 | 4.35 | 4.37 | 2.71 |
14 | 0.20 | Premium | E | SI2 | 60.2 | 62.0 | 345 | 3.79 | 3.75 | 2.27 |
15 | 0.32 | Premium | E | I1 | 60.9 | 58.0 | 345 | 4.38 | 4.42 | 2.68 |
16 | 0.30 | Ideal | I | SI2 | 62.0 | 54.0 | 348 | 4.31 | 4.34 | 2.68 |
17 | 0.30 | Good | J | SI1 | 63.4 | 54.0 | 351 | 4.23 | 4.29 | 2.70 |
18 | 0.30 | Good | J | SI1 | 63.8 | 56.0 | 351 | 4.23 | 4.26 | 2.71 |
19 | 0.30 | Very Good | J | SI1 | 62.7 | 59.0 | 351 | 4.21 | 4.27 | 2.66 |
# Load the mpg dataset from Seaborn. This is similar to the mtcars dataset,
# but has a higher count of observations.
sns.load_dataset('mpg')
mpg | cylinders | displacement | horsepower | weight | acceleration | model_year | origin | name | |
---|---|---|---|---|---|---|---|---|---|
0 | 18.0 | 8 | 307.0 | 130.0 | 3504 | 12.0 | 70 | usa | chevrolet chevelle malibu |
1 | 15.0 | 8 | 350.0 | 165.0 | 3693 | 11.5 | 70 | usa | buick skylark 320 |
2 | 18.0 | 8 | 318.0 | 150.0 | 3436 | 11.0 | 70 | usa | plymouth satellite |
3 | 16.0 | 8 | 304.0 | 150.0 | 3433 | 12.0 | 70 | usa | amc rebel sst |
4 | 17.0 | 8 | 302.0 | 140.0 | 3449 | 10.5 | 70 | usa | ford torino |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
393 | 27.0 | 4 | 140.0 | 86.0 | 2790 | 15.6 | 82 | usa | ford mustang gl |
394 | 44.0 | 4 | 97.0 | 52.0 | 2130 | 24.6 | 82 | europe | vw pickup |
395 | 32.0 | 4 | 135.0 | 84.0 | 2295 | 11.6 | 82 | usa | dodge rampage |
396 | 28.0 | 4 | 120.0 | 79.0 | 2625 | 18.6 | 82 | usa | ford ranger |
397 | 31.0 | 4 | 119.0 | 82.0 | 2720 | 19.4 | 82 | usa | chevy s-10 |
398 rows × 9 columns
# Look at how many cars from each country in the mpg dataset
sns.load_dataset('mpg').origin.value_counts()
usa 249
japan 79
europe 70
Name: origin, dtype: int64
# Build a histogram of the model year
sns.load_dataset('mpg').model_year.astype('category').hist();
# Create a random dataframe with random data
n = 25
df = pd.DataFrame(
{'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))),
'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
'education': list(np.random.choice(["High School", "Undergrad", "Grad"], size=(n))),
'housing': list(np.random.choice(["Rent", "Own"], size=(n))),
'height': list(np.random.randint(140,200,n)),
'weight': list(np.random.randint(100,150,n)),
'income': list(np.random.randint(50,250,n)),
'computers': list(np.random.randint(0,6,n))
})
df
state | gender | education | housing | height | weight | income | computers | |
---|---|---|---|---|---|---|---|---|
0 | California | Female | High School | Own | 190 | 119 | 111 | 0 |
1 | California | Female | High School | Own | 140 | 126 | 232 | 2 |
2 | New York | Female | High School | Rent | 169 | 123 | 111 | 1 |
3 | California | Female | High School | Own | 152 | 147 | 123 | 1 |
4 | New York | Female | Undergrad | Own | 197 | 111 | 206 | 4 |
5 | New York | Male | Grad | Own | 187 | 144 | 87 | 4 |
6 | California | Female | High School | Own | 189 | 115 | 75 | 5 |
7 | New York | Female | Undergrad | Own | 197 | 117 | 195 | 0 |
8 | Florida | Female | Grad | Own | 146 | 127 | 244 | 5 |
9 | New York | Female | Undergrad | Rent | 194 | 106 | 138 | 3 |
10 | New York | Female | Undergrad | Rent | 181 | 101 | 206 | 2 |
11 | California | Female | Undergrad | Rent | 156 | 121 | 243 | 3 |
12 | Florida | Male | Grad | Own | 184 | 143 | 129 | 0 |
13 | New York | Male | Grad | Own | 168 | 106 | 176 | 3 |
14 | New York | Female | Undergrad | Own | 141 | 112 | 225 | 4 |
15 | New York | Female | Undergrad | Rent | 171 | 105 | 66 | 5 |
16 | Florida | Female | Grad | Rent | 155 | 126 | 233 | 5 |
17 | California | Female | Undergrad | Rent | 193 | 106 | 162 | 4 |
18 | New York | Male | High School | Rent | 179 | 107 | 187 | 5 |
19 | California | Female | Undergrad | Own | 186 | 125 | 79 | 1 |
20 | California | Female | Grad | Own | 157 | 102 | 183 | 4 |
21 | Florida | Male | Undergrad | Rent | 174 | 109 | 94 | 5 |
22 | New York | Female | Grad | Own | 162 | 107 | 140 | 1 |
23 | New York | Female | Grad | Rent | 198 | 142 | 193 | 4 |
24 | Florida | Male | High School | Rent | 174 | 115 | 55 | 1 |
# Load the 'Old Faithful' eruption data
sns.load_dataset('geyser')
duration | waiting | kind | |
---|---|---|---|
0 | 3.600 | 79 | long |
1 | 1.800 | 54 | short |
2 | 3.333 | 74 | long |
3 | 2.283 | 62 | short |
4 | 4.533 | 85 | long |
... | ... | ... | ... |
267 | 4.117 | 81 | long |
268 | 2.150 | 46 | short |
269 | 4.417 | 90 | long |
270 | 1.817 | 46 | short |
271 | 4.467 | 74 | long |
272 rows × 3 columns
Datasets in sklearn
Scikit Learn has several datasets that are built-in as well that can be used to experiment with functions and algorithms. Some are listed below:
load_boston(*[, return_X_y])
Load and return the boston house-prices dataset (regression).
load_iris(*[, return_X_y, as_frame])
Load and return the iris dataset (classification).
load_diabetes(*[, return_X_y, as_frame])
Load and return the diabetes dataset (regression).
load_digits(*[, n_class, return_X_y, as_frame])
Load and return the digits dataset (classification).
load_linnerud(*[, return_X_y, as_frame])
Load and return the physical excercise linnerud dataset.
load_wine(*[, return_X_y, as_frame])
Load and return the wine dataset (classification).
load_breast_cancer(*[, return_X_y, as_frame])
Load and return the breast cancer wisconsin dataset (classification).
Let us import the wine dataset next, and the California housing datset after that.
from sklearn import datasets
X = datasets.load_wine()['data']
y = datasets.load_wine()['target']
features = datasets.load_wine()['feature_names']
DESCR = datasets.load_wine()['DESCR']
classes = datasets.load_wine()['target_names']
wine_df = pd.DataFrame(X, columns = features)
wine_df.insert(0,'WineType', y)
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)
df = wine_df[(wine_df['WineType'] != 2)]
# Let us look at the DESCR for the dataframe we just loaded
print(DESCR)
.. _wine_dataset:
Wine recognition dataset
------------------------
**Data Set Characteristics:**
:Number of Instances: 178
:Number of Attributes: 13 numeric, predictive attributes and the class
:Attribute Information:
- Alcohol
- Malic acid
- Ash
- Alcalinity of ash
- Magnesium
- Total phenols
- Flavanoids
- Nonflavanoid phenols
- Proanthocyanins
- Color intensity
- Hue
- OD280/OD315 of diluted wines
- Proline
- class:
- class_0
- class_1
- class_2
:Summary Statistics:
============================= ==== ===== ======= =====
Min Max Mean SD
============================= ==== ===== ======= =====
Alcohol: 11.0 14.8 13.0 0.8
Malic Acid: 0.74 5.80 2.34 1.12
Ash: 1.36 3.23 2.36 0.27
Alcalinity of Ash: 10.6 30.0 19.5 3.3
Magnesium: 70.0 162.0 99.7 14.3
Total Phenols: 0.98 3.88 2.29 0.63
Flavanoids: 0.34 5.08 2.03 1.00
Nonflavanoid Phenols: 0.13 0.66 0.36 0.12
Proanthocyanins: 0.41 3.58 1.59 0.57
Colour Intensity: 1.3 13.0 5.1 2.3
Hue: 0.48 1.71 0.96 0.23
OD280/OD315 of diluted wines: 1.27 4.00 2.61 0.71
Proline: 278 1680 746 315
============================= ==== ===== ======= =====
:Missing Attribute Values: None
:Class Distribution: class_0 (59), class_1 (71), class_2 (48)
:Creator: R.A. Fisher
:Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
:Date: July, 1988
This is a copy of UCI ML Wine recognition datasets.
https://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data
The data is the results of a chemical analysis of wines grown in the same
region in Italy by three different cultivators. There are thirteen different
measurements taken for different constituents found in the three types of
wine.
Original Owners:
Forina, M. et al, PARVUS -
An Extendible Package for Data Exploration, Classification and Correlation.
Institute of Pharmaceutical and Food Analysis and Technologies,
Via Brigata Salerno, 16147 Genoa, Italy.
Citation:
Lichman, M. (2013). UCI Machine Learning Repository
[https://archive.ics.uci.edu/ml]. Irvine, CA: University of California,
School of Information and Computer Science.
.. topic:: References
(1) S. Aeberhard, D. Coomans and O. de Vel,
Comparison of Classifiers in High Dimensional Settings,
Tech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Technometrics).
The data was used with many others for comparing various
classifiers. The classes are separable, though only RDA
has achieved 100% correct classification.
(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))
(All results using the leave-one-out technique)
(2) S. Aeberhard, D. Coomans and O. de Vel,
"THE CLASSIFICATION PERFORMANCE OF RDA"
Tech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of
Mathematics and Statistics, James Cook University of North Queensland.
(Also submitted to Journal of Chemometrics).
# California housing dataset. medv is the median value of the homes
from sklearn import datasets
X = datasets.fetch_california_housing()['data']
y = datasets.fetch_california_housing()['target']
features = datasets.fetch_california_housing()['feature_names']
DESCR = datasets.fetch_california_housing()['DESCR']
cali_df = pd.DataFrame(X, columns = features)
cali_df.insert(0,'medv', y)
cali_df
medv | MedInc | HouseAge | AveRooms | AveBedrms | Population | AveOccup | Latitude | Longitude | |
---|---|---|---|---|---|---|---|---|---|
0 | 4.526 | 8.3252 | 41.0 | 6.984127 | 1.023810 | 322.0 | 2.555556 | 37.88 | -122.23 |
1 | 3.585 | 8.3014 | 21.0 | 6.238137 | 0.971880 | 2401.0 | 2.109842 | 37.86 | -122.22 |
2 | 3.521 | 7.2574 | 52.0 | 8.288136 | 1.073446 | 496.0 | 2.802260 | 37.85 | -122.24 |
3 | 3.413 | 5.6431 | 52.0 | 5.817352 | 1.073059 | 558.0 | 2.547945 | 37.85 | -122.25 |
4 | 3.422 | 3.8462 | 52.0 | 6.281853 | 1.081081 | 565.0 | 2.181467 | 37.85 | -122.25 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
20635 | 0.781 | 1.5603 | 25.0 | 5.045455 | 1.133333 | 845.0 | 2.560606 | 39.48 | -121.09 |
20636 | 0.771 | 2.5568 | 18.0 | 6.114035 | 1.315789 | 356.0 | 3.122807 | 39.49 | -121.21 |
20637 | 0.923 | 1.7000 | 17.0 | 5.205543 | 1.120092 | 1007.0 | 2.325635 | 39.43 | -121.22 |
20638 | 0.847 | 1.8672 | 18.0 | 5.329513 | 1.171920 | 741.0 | 2.123209 | 39.43 | -121.32 |
20639 | 0.894 | 2.3886 | 16.0 | 5.254717 | 1.162264 | 1387.0 | 2.616981 | 39.37 | -121.24 |
20640 rows × 9 columns
# Again, we can look at what the various columns mean
print(DESCR)
.. _california_housing_dataset:
California Housing dataset
--------------------------
**Data Set Characteristics:**
:Number of Instances: 20640
:Number of Attributes: 8 numeric, predictive attributes and the target
:Attribute Information:
- MedInc median income in block group
- HouseAge median house age in block group
- AveRooms average number of rooms per household
- AveBedrms average number of bedrooms per household
- Population block group population
- AveOccup average number of household members
- Latitude block group latitude
- Longitude block group longitude
:Missing Attribute Values: None
This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html
The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).
This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).
An household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surpinsingly large values for block groups with few households
and many empty houses, such as vacation resorts.
It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.
.. topic:: References
- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
Statistics and Probability Letters, 33 (1997) 291-297
Create Artificial Data using sklearn
In addition to the built-in datasets, it is possible to create artificial data of arbitrary size to test or explain different algorithms for solving classification (both binary and multi-class) as well as regression problems.
One example using the make_blobs
function is provided below, but a great deal more detail is available at https://scikit-learn.org/stable/datasets/sample_generators.html#sample-generators
make_blobs
and make_classification
can create multiclass datasets, and make_regression
can be used for creating datasets with specified characteristics. Refer to the sklearn documentation link above to learn more.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs
X, y, centers = make_blobs(n_samples=1000, centers=3, n_features=2,
random_state=0, return_centers=True, center_box=(0,20),
cluster_std = 1.1)
df = pd.DataFrame(dict(x1=X[:,0], x2=X[:,1], label=y))
df = round(df,ndigits=2)
df
x1 | x2 | label | |
---|---|---|---|
0 | 9.26 | 12.64 | 2 |
1 | 12.02 | 14.14 | 0 |
2 | 8.50 | 13.12 | 2 |
3 | 8.93 | 12.87 | 2 |
4 | 7.37 | 11.82 | 2 |
... | ... | ... | ... |
995 | 11.94 | 10.92 | 1 |
996 | 9.40 | 12.17 | 2 |
997 | 10.25 | 10.45 | 1 |
998 | 7.37 | 12.01 | 2 |
999 | 11.01 | 11.17 | 1 |
1000 rows × 3 columns
plt.figure(figsize=(6,6))
sns.scatterplot(data = df, x = 'x1', y = 'x2', hue = 'label',
alpha = .8, palette="deep",edgecolor = 'None');
Exploratory Data Analysis using Python
After all of this lengthy introduction, we are finally ready to get started with actually performing some EDA.
As mentioned earlier, EDA is unstructured exploration, there is not a set of set activities you must perform. Generally, you probe the data, and depending upon what you discover, you ask more questions.
Things we will do:
- Look at how to read different types of data
- Understand how to access in-built datasets in Python
- Calculate summary statistics covered in the prior class (refer list to the right)
- Perform basic graphing using Pandas to explore the data
- Understand group-by and pivoting functions (the split-apply-combine process)
- Look at pandas-profiling, a library that can perform many data exploration tasks
Pandas is a library we will be using often, and is something we will use to explore data and perform EDA. We will also use NumPy and SciPy.
# Load the regular libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
A note on managing working directories
A very basic problem one runs into when trying to load datafiles is the file path - and if the file is not located in the current working directory for Python.
Generally, reading a CSV file is simple - pd.read_csv
and pointing to the filename does the trick. If the file is there but pandas returns an error, that could be because the file may not be located in your working directory. In such a case, enter the complete path to the file.
Alternatively, you can bring the file to your working directory. To check and change your working directory, use the following code:
import os
# To check current working directory:
os.getcwd()
'C:\\Users\\user\\Google Drive\\jupyter'
Or, you could type pwd
in a cell. Be aware that pwd should be on the first line of the cell!
pwd
'C:\\Users\\user\\Google Drive\\jupyter'
# To change working directory
os.chdir(r'C:\Users\user\Google Drive\jupyter')
EDA on the diamonds dataset
Questions we might like answered
Below is a repeat of what was said in the introduction to this chapter, just to avoid having to go back to check what we are trying to do. When performing EDA, we want to explore data in an unstructured way, and try to get a 'feel' for the data. The kinds of questions we may want to answer are:
- How much data do we have - number of rows in the data?
- How many columns, or fields do we have in the dataset?
- Data types - which of the columns appear to be numeric, dates or strings?
- Names of the columns, and do they tell us anything?
- A visual review of a sample of the dataset
- Completeness of the dataset, are missing values obvious? Columns that are largely empty?
- Unique values for columns that appear to be categorical, and how many observations of each category?
- For numeric columns, the range of values (calculated from min and max values)
- Distributions for the different columns, possibly graphed
- Correlations between the different columns
Load data
We will start our exploration with the diamonds dataset.
The ‘diamonds’ has 50k+ records, each representing a single diamond. The weight and other attributes are available, and so is the price.
The dataset allows us to experiment with a variety of prediction techniques and algorithms. Below are the columns in the dataset, and their description.
Column | Description |
---|---|
price | price in US dollars (\$326--\$18,823) |
carat | weight of the diamond (0.2--5.01) |
cut | quality of the cut (Fair, Good, Very Good, Premium, Ideal) |
color | diamond colour, from J (worst) to D (best) |
clarity | a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)) |
x | length in mm (0--10.74) |
y | width in mm (0--58.9) |
z | depth in mm (0--31.8) |
depth | total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79) |
table | width of top of diamond relative to widest point (43--95) |
# Load data from seaborn
df = sns.load_dataset('diamonds')
df
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 10 columns
Descriptive stats
Pandas describe()
function provides a variety of summary statistics. Review the table below. Notice the categorical variables were ignored. This is because descriptive stats do not make sense for categorical variables.
# Let us look at some descriptive statistics for the numerical variables
df.describe()
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
count | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 | 53940.000000 |
mean | 0.797940 | 61.749405 | 57.457184 | 3932.799722 | 5.731157 | 5.734526 | 3.538734 |
std | 0.474011 | 1.432621 | 2.234491 | 3989.439738 | 1.121761 | 1.142135 | 0.705699 |
min | 0.200000 | 43.000000 | 43.000000 | 326.000000 | 0.000000 | 0.000000 | 0.000000 |
25% | 0.400000 | 61.000000 | 56.000000 | 950.000000 | 4.710000 | 4.720000 | 2.910000 |
50% | 0.700000 | 61.800000 | 57.000000 | 2401.000000 | 5.700000 | 5.710000 | 3.530000 |
75% | 1.040000 | 62.500000 | 59.000000 | 5324.250000 | 6.540000 | 6.540000 | 4.040000 |
max | 5.010000 | 79.000000 | 95.000000 | 18823.000000 | 10.740000 | 58.900000 | 31.800000 |
df.info()
gives you information on the dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null category
2 color 53940 non-null category
3 clarity 53940 non-null category
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
Similarly, df.shape
gives you a tuple with the counts of rows and columns.
Trivia:
- Note there is no ()
after df.shape
, as it is a property. Properties are the 'attributes' of the object that can be set using methods.
- Methods are like functions, but are inbuilt, and apply to an object. They are part of the class definition for the object.
df.shape
(53940, 10)
df.columns
gives you the names of the columns.
df.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
'z'],
dtype='object')
Exploring individual columns
Pandas provide a large number of functions that allow us to explore several statistics relating to individual variables.
Measures | Function (from Pandas, unless otherwise stated) |
---|---|
Central Tendency | |
Mean | mean() |
Geometric Mean | gmean() (from scipy.stats) |
Median | median() |
Mode | mode() |
Measures of Variability | |
Range | max() - min() |
Variance | var() |
Standard Deviation | std() |
Coefficient of Variation | std() / mean() |
Measures of Association | |
Covariance | cov() |
Correlation | corr() |
Analyzing Distributions | |
Percentiles | quantile() |
Quartiles | quantile() |
Z-Scores | zscore (from scipy) |
We examine many of these in action below.
Functions for descriptive stats
# Mean
df.mean(numeric_only=True)
carat 0.797940
depth 61.749405
table 57.457184
price 3932.799722
x 5.731157
y 5.734526
z 3.538734
dtype: float64
# Median
df.median(numeric_only=True)
carat 0.70
depth 61.80
table 57.00
price 2401.00
x 5.70
y 5.71
z 3.53
dtype: float64
# Mode
df.mode()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.3 | Ideal | G | SI1 | 62.0 | 56.0 | 605 | 4.37 | 4.34 | 2.7 |
# Min, also max works as well
df.min(numeric_only=True)
carat 0.2
depth 43.0
table 43.0
price 326.0
x 0.0
y 0.0
z 0.0
dtype: float64
# Variance
df.var(numeric_only=True)
carat 2.246867e-01
depth 2.052404e+00
table 4.992948e+00
price 1.591563e+07
x 1.258347e+00
y 1.304472e+00
z 4.980109e-01
dtype: float64
# Standard Deviation
df.std(numeric_only=True)
carat 0.474011
depth 1.432621
table 2.234491
price 3989.439738
x 1.121761
y 1.142135
z 0.705699
dtype: float64
Some quick histograms
Histograms allow us to look at the distribution of the data. The df.colname.hist()
function allows us to create quick histograms (or column charts in case of categorical variables).
Visualization using Matplotlib is covered in a different chapter.
# A quick histogram
df.carat.hist();
df.depth.hist();
df.cut.hist();
# All together
df.hist(figsize=(16,10));
Calculate range
# Let us calculate the range manually
df.depth.max() - df.depth.min()
36.0
Covariance and correlations
# Let us do the covariance matrix, which is a one-liner with pandas
df.cov()
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
carat | 0.224687 | 0.019167 | 0.192365 | 1.742765e+03 | 0.518484 | 0.515248 | 0.318917 |
depth | 0.019167 | 2.052404 | -0.946840 | -6.085371e+01 | -0.040641 | -0.048009 | 0.095968 |
table | 0.192365 | -0.946840 | 4.992948 | 1.133318e+03 | 0.489643 | 0.468972 | 0.237996 |
price | 1742.765364 | -60.853712 | 1133.318064 | 1.591563e+07 | 3958.021491 | 3943.270810 | 2424.712613 |
x | 0.518484 | -0.040641 | 0.489643 | 3.958021e+03 | 1.258347 | 1.248789 | 0.768487 |
y | 0.515248 | -0.048009 | 0.468972 | 3.943271e+03 | 1.248789 | 1.304472 | 0.767320 |
z | 0.318917 | 0.095968 | 0.237996 | 2.424713e+03 | 0.768487 | 0.767320 | 0.498011 |
# Now the correlation matrix - another one-liner
df.corr()
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
carat | 1.000000 | 0.028224 | 0.181618 | 0.921591 | 0.975094 | 0.951722 | 0.953387 |
depth | 0.028224 | 1.000000 | -0.295779 | -0.010647 | -0.025289 | -0.029341 | 0.094924 |
table | 0.181618 | -0.295779 | 1.000000 | 0.127134 | 0.195344 | 0.183760 | 0.150929 |
price | 0.921591 | -0.010647 | 0.127134 | 1.000000 | 0.884435 | 0.865421 | 0.861249 |
x | 0.975094 | -0.025289 | 0.195344 | 0.884435 | 1.000000 | 0.974701 | 0.970772 |
y | 0.951722 | -0.029341 | 0.183760 | 0.865421 | 0.974701 | 1.000000 | 0.952006 |
z | 0.953387 | 0.094924 | 0.150929 | 0.861249 | 0.970772 | 0.952006 | 1.000000 |
# We can also calculate the correlations individually between given variables
df[['carat', 'depth']].corr()
carat | depth | |
---|---|---|
carat | 1.000000 | 0.028224 |
depth | 0.028224 | 1.000000 |
# We can create a heatmap of correlations
plt.figure(figsize = (8,8))
sns.heatmap(df.corr(), annot=True);
plt.show()
# We can calculate phi-k correlations as well
import phik
X = df.phik_matrix()
X
interval columns not set, guessing: ['carat', 'depth', 'table', 'price', 'x', 'y', 'z']
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
carat | 1.000000 | 0.270726 | 0.261376 | 0.320729 | 0.093835 | 0.127877 | 0.860178 | 0.885596 | 0.685737 | 0.821934 |
cut | 0.270726 | 1.000000 | 0.057308 | 0.229186 | 0.604758 | 0.441720 | 0.220674 | 0.237591 | 0.131938 | 0.115199 |
color | 0.261376 | 0.057308 | 1.000000 | 0.146758 | 0.040634 | 0.039959 | 0.183244 | 0.238246 | 0.191040 | 0.140158 |
clarity | 0.320729 | 0.229186 | 0.146758 | 1.000000 | 0.154796 | 0.148489 | 0.295205 | 0.435204 | 0.419662 | 0.425129 |
depth | 0.093835 | 0.604758 | 0.040634 | 0.154796 | 1.000000 | 0.362929 | 0.064652 | 0.124055 | 0.073533 | 0.097474 |
table | 0.127877 | 0.441720 | 0.039959 | 0.148489 | 0.362929 | 1.000000 | 0.115604 | 0.187285 | 0.190942 | 0.121229 |
price | 0.860178 | 0.220674 | 0.183244 | 0.295205 | 0.064652 | 0.115604 | 1.000000 | 0.755270 | 0.714089 | 0.656248 |
x | 0.885596 | 0.237591 | 0.238246 | 0.435204 | 0.124055 | 0.187285 | 0.755270 | 1.000000 | 0.822881 | 0.882911 |
y | 0.685737 | 0.131938 | 0.191040 | 0.419662 | 0.073533 | 0.190942 | 0.714089 | 0.822881 | 1.000000 | 0.816241 |
z | 0.821934 | 0.115199 | 0.140158 | 0.425129 | 0.097474 | 0.121229 | 0.656248 | 0.882911 | 0.816241 | 1.000000 |
sns.heatmap(X, annot=True);
Detailed Phi-k correlation report
from phik import report
phik.report.correlation_report(df)
Quantiles to analyze the distribution
# Calculating quantiles
# Here we calculate the 30th quantile
df.quantile(0.30)
carat 0.42
depth 61.20
table 56.00
price 1087.00
x 4.82
y 4.83
z 2.98
Name: 0.3, dtype: float64
# Calculating multiple quantiles
df.quantile([.1,.3,.5,.75])
carat | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|
0.10 | 0.31 | 60.0 | 55.0 | 646.00 | 4.36 | 4.36 | 2.69 |
0.30 | 0.42 | 61.2 | 56.0 | 1087.00 | 4.82 | 4.83 | 2.98 |
0.50 | 0.70 | 61.8 | 57.0 | 2401.00 | 5.70 | 5.71 | 3.53 |
0.75 | 1.04 | 62.5 | 59.0 | 5324.25 | 6.54 | 6.54 | 4.04 |
Z-scores
# Z-scores for two of the columns (x - mean(x))/std(x)
from scipy.stats import zscore
zscores = zscore(df[['carat', 'depth']])
# Verify z-scores have mean of 0 and standard deviation of 1:
print('Z-scores: \n', zscores, '\n')
print('Mean is: ', zscores.mean(axis = 0), '\n')
print('Std Deviation is: ', zscores.std(axis = 0), '\n')
Z-scores:
carat depth
0 -1.198168 -0.174092
1 -1.240361 -1.360738
2 -1.198168 -3.385019
3 -1.071587 0.454133
4 -1.029394 1.082358
... ... ...
53935 -0.164427 -0.662711
53936 -0.164427 0.942753
53937 -0.206621 0.733344
53938 0.130927 -0.523105
53939 -0.101137 0.314528
[53940 rows x 2 columns]
Mean is: carat 2.889982e-14
depth -3.658830e-15
dtype: float64
Std Deviation is: carat 1.000009
depth 1.000009
dtype: float64
Dataframe information
# Look at some dataframe information
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 53940 non-null float64
1 cut 53940 non-null category
2 color 53940 non-null category
3 clarity 53940 non-null category
4 depth 53940 non-null float64
5 table 53940 non-null float64
6 price 53940 non-null int64
7 x 53940 non-null float64
8 y 53940 non-null float64
9 z 53940 non-null float64
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
Names of columns
# Column names
df.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
'z'],
dtype='object')
Other useful functions
Sort:
df.sort_values(['price', 'table'], ascending = [False, True]).head()
Unique values:df.cut.unique()
Count of unique values:df.cut.nunique()
Value Counts:df.cut.value_counts()
Take a sample from a dataframe:diamonds.sample(4)
(or n=4)
Rename columns:df.rename(columns = {'price':'dollars'}, inplace = True)
Split-Apply-Combine
The phrase Split-Apply-Combine was made popular by Hadley Wickham, who is the author of the popular dplyr package in R. His original paper on the topic can be downloaded at https://www.jstatsoft.org/article/download/v040i01/468
Conceptually, it involves:
- Splitting the data into sub-groups based on some filtering criteria
- Applying a function to each sub-group and obtaining a result
- Combining the results into one single dataframe.
Split-Apply-Combine does not represent three separate steps in data analysis, but a way to think about solving problems by breaking them up into manageable pieces, operate on each piece independently, and put all the pieces back together.
In Python, the Split-Apply-Combine operations are implemented using different functions such as pivot, pivot_table, crosstab, groupby and possibly others.
Ref: http://www.jstatsoft.org/v40/i01/
Stack
Even though stack
and unstack
do not pivot data, they reshape a data in a fundamental way that deserves a reference alongside the standard split-apply-combine techniques.
What stack does is to completely flatten out a dataframe by bringing all columns down against the index. The index becomes a multi-level index, and all the columns show up against every single row.
The result is a pandas series, with as many rows as the rows times columns in the original dataset.
You can then move the index into the columns of a dataframe by doing reset_index()
.
Let us first consider a simpler dataframe with just a few entries.
Example 1
df = pd.DataFrame([[9, 10], [14, 30]],
index=['cat', 'dog'],
columns=['weight-lbs', 'height-in'])
df
weight-lbs | height-in | |
---|---|---|
cat | 9 | 10 |
dog | 14 | 30 |
df.stack()
cat weight-lbs 9
height-in 10
dog weight-lbs 14
height-in 30
dtype: int64
# Convert this to a dataframe
pd.DataFrame(df.stack()).reset_index().rename({'level_0': 'animal', 'level_1':'measure', 0: 'value'}, axis=1)
animal | measure | value | |
---|---|---|---|
0 | cat | weight-lbs | 9 |
1 | cat | height-in | 10 |
2 | dog | weight-lbs | 14 |
3 | dog | height-in | 30 |
type(df.stack())
pandas.core.series.Series
df.stack().index
MultiIndex([('cat', 'weight-lbs'),
('cat', 'height-in'),
('dog', 'weight-lbs'),
('dog', 'height-in')],
)
Example 2:
Now we look at a larger dataframe.
import statsmodels.api as sm
iris = sm.datasets.get_rdataset('iris').data
# Let us look at the original data before we stack it
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... | ... |
145 | 6.7 | 3.0 | 5.2 | 2.3 | virginica |
146 | 6.3 | 2.5 | 5.0 | 1.9 | virginica |
147 | 6.5 | 3.0 | 5.2 | 2.0 | virginica |
148 | 6.2 | 3.4 | 5.4 | 2.3 | virginica |
149 | 5.9 | 3.0 | 5.1 | 1.8 | virginica |
150 rows × 5 columns
iris.stack()
0 Sepal.Length 5.1
Sepal.Width 3.5
Petal.Length 1.4
Petal.Width 0.2
Species setosa
...
149 Sepal.Length 5.9
Sepal.Width 3.0
Petal.Length 5.1
Petal.Width 1.8
Species virginica
Length: 750, dtype: object
We had 150 rows and 5 columns in our original dataset, and we would therefore expect to have 150*5 = 750 items in our stacked series. Which we can verify.
iris.shape[0] * iris.shape[1]
750
Example 3:
We stack the mtcars dataset.
mtcars = sm.datasets.get_rdataset('mtcars').data
mtcars
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
mtcars.stack()
Mazda RX4 mpg 21.0
cyl 6.0
disp 160.0
hp 110.0
drat 3.9
...
Volvo 142E qsec 18.6
vs 1.0
am 1.0
gear 4.0
carb 2.0
Length: 352, dtype: float64
Unstack
Unstack is the same as the stack of the transpose of a dataframe.
So you flip the rows and columns of a database, and you then do a stack.
mtcars.transpose()
Mazda RX4 | Mazda RX4 Wag | Datsun 710 | Hornet 4 Drive | Hornet Sportabout | Valiant | Duster 360 | Merc 240D | Merc 230 | Merc 280 | ... | AMC Javelin | Camaro Z28 | Pontiac Firebird | Fiat X1-9 | Porsche 914-2 | Lotus Europa | Ford Pantera L | Ferrari Dino | Maserati Bora | Volvo 142E | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mpg | 21.00 | 21.000 | 22.80 | 21.400 | 18.70 | 18.10 | 14.30 | 24.40 | 22.80 | 19.20 | ... | 15.200 | 13.30 | 19.200 | 27.300 | 26.00 | 30.400 | 15.80 | 19.70 | 15.00 | 21.40 |
cyl | 6.00 | 6.000 | 4.00 | 6.000 | 8.00 | 6.00 | 8.00 | 4.00 | 4.00 | 6.00 | ... | 8.000 | 8.00 | 8.000 | 4.000 | 4.00 | 4.000 | 8.00 | 6.00 | 8.00 | 4.00 |
disp | 160.00 | 160.000 | 108.00 | 258.000 | 360.00 | 225.00 | 360.00 | 146.70 | 140.80 | 167.60 | ... | 304.000 | 350.00 | 400.000 | 79.000 | 120.30 | 95.100 | 351.00 | 145.00 | 301.00 | 121.00 |
hp | 110.00 | 110.000 | 93.00 | 110.000 | 175.00 | 105.00 | 245.00 | 62.00 | 95.00 | 123.00 | ... | 150.000 | 245.00 | 175.000 | 66.000 | 91.00 | 113.000 | 264.00 | 175.00 | 335.00 | 109.00 |
drat | 3.90 | 3.900 | 3.85 | 3.080 | 3.15 | 2.76 | 3.21 | 3.69 | 3.92 | 3.92 | ... | 3.150 | 3.73 | 3.080 | 4.080 | 4.43 | 3.770 | 4.22 | 3.62 | 3.54 | 4.11 |
wt | 2.62 | 2.875 | 2.32 | 3.215 | 3.44 | 3.46 | 3.57 | 3.19 | 3.15 | 3.44 | ... | 3.435 | 3.84 | 3.845 | 1.935 | 2.14 | 1.513 | 3.17 | 2.77 | 3.57 | 2.78 |
qsec | 16.46 | 17.020 | 18.61 | 19.440 | 17.02 | 20.22 | 15.84 | 20.00 | 22.90 | 18.30 | ... | 17.300 | 15.41 | 17.050 | 18.900 | 16.70 | 16.900 | 14.50 | 15.50 | 14.60 | 18.60 |
vs | 0.00 | 0.000 | 1.00 | 1.000 | 0.00 | 1.00 | 0.00 | 1.00 | 1.00 | 1.00 | ... | 0.000 | 0.00 | 0.000 | 1.000 | 0.00 | 1.000 | 0.00 | 0.00 | 0.00 | 1.00 |
am | 1.00 | 1.000 | 1.00 | 0.000 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 0.000 | 0.00 | 0.000 | 1.000 | 1.00 | 1.000 | 1.00 | 1.00 | 1.00 | 1.00 |
gear | 4.00 | 4.000 | 4.00 | 3.000 | 3.00 | 3.00 | 3.00 | 4.00 | 4.00 | 4.00 | ... | 3.000 | 3.00 | 3.000 | 4.000 | 5.00 | 5.000 | 5.00 | 5.00 | 5.00 | 4.00 |
carb | 4.00 | 4.000 | 1.00 | 1.000 | 2.00 | 1.00 | 4.00 | 2.00 | 2.00 | 4.00 | ... | 2.000 | 4.00 | 2.000 | 1.000 | 2.00 | 2.000 | 4.00 | 6.00 | 8.00 | 2.00 |
11 rows × 32 columns
mtcars.unstack()
mpg Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
...
carb Lotus Europa 2.0
Ford Pantera L 4.0
Ferrari Dino 6.0
Maserati Bora 8.0
Volvo 142E 2.0
Length: 352, dtype: float64
mtcars.transpose().stack()
mpg Mazda RX4 21.0
Mazda RX4 Wag 21.0
Datsun 710 22.8
Hornet 4 Drive 21.4
Hornet Sportabout 18.7
...
carb Lotus Europa 2.0
Ford Pantera L 4.0
Ferrari Dino 6.0
Maserati Bora 8.0
Volvo 142E 2.0
Length: 352, dtype: float64
# Check the row count
mtcars.stack().shape
(352,)
# Expected row count in stack
mtcars.shape[0] * mtcars.shape[1]
352
Pivot table
A powerful way the idea behind split-apply_combine is implemented is through pivot tables. Pivot tables allow reshaping the data into useful summaries. Pivot tables are widely used by Excel users, and you will find them used in reports, presentations and analysis of all types. Pandas offers a great deal of flexibility for creating pivot tables using the pivot_table function.
The pivot_table function is essentially a copy of the Excel functionality.
index
- On the left is the index, and you can specify multiple columns there. Each unique value in that index column will have a separate line. Under each of these lines, there will be a line for each value of the second column in the index, and so on.columns
- On the top are the columns, again in the order in which specified in the parameters to the function. The first column specified is on the top, and underneath will be all unique values of that column. This is followed by the next column in the list, and so on.values
- Inside the table itself are values derived from the columns named in the values parameter. The default for values is the mean of the value columns, but you can change it to other functions using aggfunc.aggfunc
- Next is aggfunc. You can specify any function from any library that returns a single value.
CAUTION
It is really easy to get pivot tables wrong and get something incomprehensible. To create a sensible pivot table, it makes sense to:
- have categorical columns in both index and columns. If you use numerical variables in either, the length of your columns/rows will explode unless the number of unique values is limited.
- have columns in the values parameter that lend themselves to the aggregation function specified. So if you specify a categorical column for values, and ask pandas to show the mean, you will be setting yourself up for disappointment. If you are using a categorical column for values, be sure to use an appropriate aggregation function eg count
.
mtcars.head()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
# Some transformations to help understand pivots better
mtcars.cyl = mtcars.cyl.replace({4: 'Four', 6: 'Six', 8: 'Eight'} )
mtcars.am = mtcars.am.replace({1: 'Automatic', 0: 'Manual'} )
mtcars = mtcars.head(8)
mtcars
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | Six | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | Automatic | 4 | 4 |
Mazda RX4 Wag | 21.0 | Six | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | Automatic | 4 | 4 |
Datsun 710 | 22.8 | Four | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | Automatic | 4 | 1 |
Hornet 4 Drive | 21.4 | Six | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | Manual | 3 | 1 |
Hornet Sportabout | 18.7 | Eight | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | Manual | 3 | 2 |
Valiant | 18.1 | Six | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | Manual | 3 | 1 |
Duster 360 | 14.3 | Eight | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | Manual | 3 | 4 |
Merc 240D | 24.4 | Four | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | Manual | 4 | 2 |
mtcars.pivot_table(index = ['gear','cyl'],
values = ['wt'])
wt | ||
---|---|---|
gear | cyl | |
3 | Eight | 3.5050 |
Six | 3.3375 | |
4 | Four | 2.7550 |
Six | 2.7475 |
mtcars.pivot_table(index = ['am', 'gear'],
columns = ['cyl'],
values = ['wt'])
wt | ||||
---|---|---|---|---|
cyl | Eight | Four | Six | |
am | gear | |||
Automatic | 4 | NaN | 2.32 | 2.7475 |
Manual | 3 | 3.505 | NaN | 3.3375 |
4 | NaN | 3.19 | NaN |
mtcars.pivot_table(index = ['am', 'gear'],
columns = ['cyl'],
values = ['wt'],
aggfunc = ['mean', 'count', 'median', 'sum'])
mean | count | median | sum | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
wt | wt | wt | wt | ||||||||||
cyl | Eight | Four | Six | Eight | Four | Six | Eight | Four | Six | Eight | Four | Six | |
am | gear | ||||||||||||
Automatic | 4 | NaN | 2.32 | 2.7475 | NaN | 1.0 | 2.0 | NaN | 2.32 | 2.7475 | NaN | 2.32 | 5.495 |
Manual | 3 | 3.505 | NaN | 3.3375 | 2.0 | NaN | 2.0 | 3.505 | NaN | 3.3375 | 7.01 | NaN | 6.675 |
4 | NaN | 3.19 | NaN | NaN | 1.0 | NaN | NaN | 3.19 | NaN | NaN | 3.19 | NaN |
diamonds = sns.load_dataset('diamonds')
diamonds.pivot_table(index = ['clarity', 'cut'],
columns = ['color'],
values = ['depth', 'price', 'x'],
aggfunc = {'depth': np.mean,
'price': [min, max, np.median],
'x': np.median}
)
depth | price | x | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | max | ... | min | median | ||||||||||||||||||
color | D | E | F | G | H | I | J | D | E | F | ... | H | I | J | D | E | F | G | H | I | J | |
clarity | cut | |||||||||||||||||||||
IF | Ideal | 61.496429 | 61.716456 | 61.614925 | 61.663951 | 61.557522 | 61.751579 | 61.956000 | 17590.0 | 18700.0 | 18435.0 | ... | 468.0 | 587.0 | 489.0 | 5.315 | 4.430 | 4.400 | 4.510 | 4.610 | 4.570 | 4.710 |
Premium | 61.070000 | 60.859259 | 61.112903 | 60.904598 | 61.290000 | 61.078261 | 61.458333 | 18279.0 | 17663.0 | 18102.0 | ... | 739.0 | 631.0 | 533.0 | 6.100 | 4.640 | 4.390 | 4.640 | 4.440 | 4.830 | 6.570 | |
Very Good | 61.513043 | 61.160465 | 61.123881 | 61.470886 | 61.858621 | 61.278947 | 61.387500 | 18542.0 | 12895.0 | 18552.0 | ... | 369.0 | 673.0 | 529.0 | 6.170 | 4.770 | 4.730 | 4.800 | 4.710 | 5.490 | 4.715 | |
Good | 60.877778 | 61.811111 | 60.620000 | 61.509091 | 61.975000 | 62.150000 | 62.466667 | 17499.0 | 6804.0 | 9867.0 | ... | 1440.0 | 631.0 | 827.0 | 6.260 | 4.280 | 5.120 | 5.025 | 6.080 | 4.755 | 5.025 | |
Fair | 60.766667 | NaN | 58.925000 | 61.300000 | NaN | NaN | NaN | 2211.0 | NaN | 3205.0 | ... | NaN | NaN | NaN | 4.680 | NaN | 5.285 | 4.905 | NaN | NaN | NaN | |
VVS1 | Ideal | 61.710417 | 61.608358 | 61.649545 | 61.667508 | 61.720552 | 61.794972 | 61.844828 | 16253.0 | 16256.0 | 18682.0 | ... | 449.0 | 414.0 | 461.0 | 4.730 | 4.490 | 4.670 | 4.760 | 4.765 | 4.880 | 4.890 |
Premium | 61.182500 | 61.219048 | 61.121250 | 61.060234 | 61.353571 | 61.627381 | 61.754167 | 17496.0 | 14952.0 | 14196.0 | ... | 432.0 | 414.0 | 775.0 | 4.775 | 4.510 | 4.825 | 4.740 | 4.430 | 4.800 | 7.145 | |
Very Good | 61.675000 | 61.504118 | 61.545977 | 61.586316 | 61.980000 | 62.165217 | 61.684211 | 17932.0 | 15878.0 | 18777.0 | ... | 434.0 | 336.0 | 544.0 | 4.685 | 4.240 | 4.505 | 4.620 | 4.690 | 4.910 | 5.700 | |
Good | 61.653846 | 61.525581 | 62.291429 | 61.987805 | 62.477419 | 62.990909 | 63.500000 | 8239.0 | 10696.0 | 11182.0 | ... | 401.0 | 552.0 | 4633.0 | 4.680 | 4.450 | 4.470 | 4.860 | 4.430 | 5.295 | 6.290 | |
Fair | 61.666667 | 59.600000 | 59.100000 | 60.066667 | 56.500000 | 63.500000 | 67.600000 | 10752.0 | 8529.0 | 12648.0 | ... | 4115.0 | 4194.0 | 1691.0 | 4.920 | 5.340 | 4.850 | 5.670 | 6.380 | 5.980 | 5.560 | |
VVS2 | Ideal | 61.584859 | 61.681460 | 61.646923 | 61.692377 | 61.753633 | 61.883708 | 61.759259 | 16130.0 | 18188.0 | 18614.0 | ... | 442.0 | 412.0 | 413.0 | 4.770 | 4.710 | 5.120 | 5.180 | 4.670 | 5.110 | 5.760 |
Premium | 61.024468 | 61.076860 | 61.277397 | 61.297091 | 61.496610 | 61.446341 | 61.435294 | 17216.0 | 17667.0 | 17203.0 | ... | 486.0 | 526.0 | 778.0 | 4.860 | 4.680 | 5.165 | 5.150 | 4.570 | 4.750 | 7.115 | |
Very Good | 61.328369 | 61.497315 | 61.541767 | 61.821523 | 61.895862 | 61.957746 | 62.410345 | 17545.0 | 17689.0 | 17317.0 | ... | 378.0 | 427.0 | 336.0 | 4.570 | 4.280 | 4.830 | 5.100 | 4.670 | 5.600 | 6.860 | |
Good | 62.284000 | 62.192308 | 61.824000 | 62.625333 | 62.562222 | 62.500000 | 61.661538 | 8943.0 | 17449.0 | 14654.0 | ... | 440.0 | 579.0 | 375.0 | 4.740 | 4.895 | 5.300 | 5.060 | 5.220 | 5.625 | 6.340 | |
Fair | 61.677778 | 60.623077 | 62.610000 | 64.376471 | 63.600000 | 63.400000 | 66.000000 | 10562.0 | 7918.0 | 16364.0 | ... | 922.0 | 1401.0 | 2998.0 | 4.950 | 5.270 | 5.200 | 5.430 | 6.030 | 5.780 | 6.290 | |
VS1 | Ideal | 61.620228 | 61.638449 | 61.660065 | 61.696642 | 61.789293 | 61.813971 | 61.835323 | 17659.0 | 18729.0 | 18780.0 | ... | 423.0 | 358.0 | 340.0 | 4.880 | 4.770 | 5.280 | 5.310 | 5.230 | 5.750 | 6.150 |
Premium | 61.132824 | 61.119863 | 61.197241 | 61.419965 | 61.398512 | 61.297285 | 61.565359 | 17936.0 | 17552.0 | 18598.0 | ... | 382.0 | 355.0 | 394.0 | 5.630 | 5.135 | 5.730 | 5.270 | 5.695 | 6.500 | 6.880 | |
Very Good | 61.553143 | 61.593174 | 61.495222 | 61.701620 | 62.004669 | 61.947805 | 62.024167 | 16750.0 | 16988.0 | 17685.0 | ... | 338.0 | 397.0 | 394.0 | 5.120 | 5.200 | 5.560 | 5.295 | 5.750 | 6.120 | 6.175 | |
Good | 61.597674 | 61.602247 | 61.317424 | 62.446711 | 62.277922 | 62.369903 | 62.528846 | 17111.0 | 17400.0 | 17330.0 | ... | 435.0 | 457.0 | 394.0 | 5.610 | 5.670 | 5.330 | 5.915 | 5.670 | 6.060 | 5.705 | |
Fair | 63.160000 | 61.371429 | 62.430303 | 63.353333 | 63.309375 | 62.796000 | 63.675000 | 7083.0 | 15584.0 | 17995.0 | ... | 1134.0 | 735.0 | 949.0 | 5.560 | 5.435 | 5.940 | 5.620 | 6.100 | 6.000 | 6.210 | |
VS2 | Ideal | 61.688478 | 61.717077 | 61.726394 | 61.726813 | 61.804317 | 61.778082 | 61.734914 | 18318.0 | 17825.0 | 18421.0 | ... | 367.0 | 371.0 | 384.0 | 4.815 | 5.090 | 5.170 | 5.655 | 5.690 | 5.920 | 6.480 |
Premium | 61.146313 | 61.259459 | 61.303231 | 61.287933 | 61.324624 | 61.296825 | 61.381683 | 16921.0 | 18342.0 | 18791.0 | ... | 471.0 | 334.0 | 368.0 | 5.120 | 5.170 | 5.390 | 5.860 | 6.495 | 6.920 | 6.900 | |
Very Good | 61.968285 | 61.782903 | 61.807082 | 61.901670 | 61.913564 | 61.715693 | 61.868478 | 17153.0 | 18557.0 | 18430.0 | ... | 376.0 | 379.0 | 357.0 | 5.160 | 5.300 | 5.640 | 5.870 | 6.125 | 6.285 | 6.560 | |
Good | 62.758654 | 61.877500 | 62.487500 | 62.365104 | 62.675362 | 62.107273 | 62.346667 | 17760.0 | 15385.0 | 17597.0 | ... | 470.0 | 435.0 | 368.0 | 5.560 | 5.685 | 5.685 | 6.015 | 6.025 | 6.310 | 6.525 | |
Fair | 62.684000 | 64.476190 | 63.577358 | 63.880000 | 63.960976 | 62.384375 | 63.973913 | 15152.0 | 12829.0 | 13853.0 | ... | 704.0 | 855.0 | 416.0 | 6.040 | 5.430 | 5.820 | 6.080 | 6.090 | 6.045 | 6.080 | |
SI1 | Ideal | 61.736179 | 61.713708 | 61.669079 | 61.717424 | 61.763041 | 61.791468 | 61.849794 | 16575.0 | 18193.0 | 18306.0 | ... | 357.0 | 382.0 | 367.0 | 5.160 | 5.375 | 5.730 | 5.720 | 6.420 | 6.470 | 6.640 |
Premium | 61.254317 | 61.229153 | 61.346875 | 61.340106 | 61.332824 | 61.318256 | 61.306699 | 17776.0 | 16957.0 | 18735.0 | ... | 421.0 | 394.0 | 363.0 | 5.300 | 5.680 | 5.940 | 6.160 | 6.580 | 6.780 | 6.880 | |
Very Good | 61.822470 | 61.947764 | 61.942039 | 61.963502 | 61.990676 | 62.075978 | 61.873626 | 16286.0 | 18731.0 | 18759.0 | ... | 337.0 | 382.0 | 351.0 | 5.650 | 5.660 | 5.780 | 5.695 | 6.350 | 6.425 | 6.460 | |
Good | 62.755696 | 62.754085 | 62.499267 | 62.896618 | 62.585957 | 62.825455 | 62.496591 | 18468.0 | 18027.0 | 18376.0 | ... | 402.0 | 377.0 | 339.0 | 5.590 | 5.600 | 5.750 | 6.070 | 6.180 | 6.230 | 6.430 | |
Fair | 64.634483 | 63.226154 | 63.230120 | 64.513043 | 64.488000 | 63.883333 | 63.010714 | 16386.0 | 15330.0 | 16280.0 | ... | 659.0 | 1697.0 | 497.0 | 6.080 | 6.060 | 6.060 | 6.070 | 6.260 | 6.365 | 6.535 | |
SI2 | Ideal | 61.673876 | 61.680171 | 61.708830 | 61.732510 | 61.627111 | 61.751095 | 61.883636 | 18693.0 | 18128.0 | 18578.0 | ... | 362.0 | 348.0 | 344.0 | 5.730 | 6.100 | 6.130 | 6.330 | 6.600 | 6.920 | 6.835 |
Premium | 61.099287 | 61.095376 | 61.174761 | 61.183943 | 61.219194 | 61.305128 | 61.280745 | 18575.0 | 18477.0 | 18784.0 | ... | 368.0 | 500.0 | 405.0 | 6.320 | 6.340 | 6.420 | 6.550 | 6.810 | 7.000 | 7.360 | |
Very Good | 61.743631 | 61.764719 | 61.782216 | 62.011009 | 62.006997 | 61.935500 | 61.835938 | 18526.0 | 18128.0 | 18692.0 | ... | 393.0 | 383.0 | 430.0 | 6.290 | 6.180 | 6.270 | 6.330 | 6.620 | 6.790 | 6.795 | |
Good | 62.063229 | 61.986634 | 62.250746 | 62.544172 | 62.391139 | 62.365432 | 62.388679 | 17094.0 | 18236.0 | 18686.0 | ... | 368.0 | 351.0 | 335.0 | 6.110 | 6.040 | 6.320 | 6.320 | 6.410 | 6.910 | 6.750 | |
Fair | 64.703571 | 63.448718 | 63.834831 | 64.573750 | 64.931868 | 65.564444 | 64.511111 | 16086.0 | 15540.0 | 17405.0 | ... | 1059.0 | 1625.0 | 1362.0 | 6.130 | 6.280 | 6.260 | 6.325 | 6.650 | 7.050 | 6.560 | |
I1 | Ideal | 61.453846 | 61.850000 | 61.588095 | 61.400000 | 61.657895 | 61.729412 | 63.500000 | 13156.0 | 9072.0 | 10685.0 | ... | 3080.0 | 2239.0 | 2370.0 | 6.730 | 6.510 | 6.560 | 6.765 | 6.925 | 6.820 | 7.665 |
Premium | 61.900000 | 60.806667 | 61.150000 | 61.113043 | 61.247826 | 61.291667 | 61.300000 | 6300.0 | 10453.0 | 9967.0 | ... | 452.0 | 1107.0 | 945.0 | 6.645 | 6.525 | 6.465 | 6.645 | 6.620 | 7.100 | 6.900 | |
Very Good | 62.200000 | 61.481818 | 61.561538 | 61.943750 | 61.816667 | 62.100000 | 61.737500 | 3816.0 | 10340.0 | 9789.0 | ... | 2850.0 | 1235.0 | 2048.0 | 6.180 | 6.500 | 6.740 | 6.430 | 7.290 | 7.050 | 7.060 | |
Good | 61.350000 | 61.660870 | 62.889474 | 62.568421 | 61.757143 | 61.622222 | 61.650000 | 6088.0 | 11548.0 | 6686.0 | ... | 1134.0 | 1111.0 | 1945.0 | 6.720 | 7.010 | 6.090 | 6.640 | 6.400 | 6.930 | 6.950 | |
Fair | 65.600000 | 65.644444 | 65.657143 | 65.333962 | 65.759615 | 65.729412 | 66.460870 | 15964.0 | 3692.0 | 7294.0 | ... | 1058.0 | 1014.0 | 1066.0 | 7.325 | 6.180 | 5.640 | 6.170 | 6.930 | 6.660 | 7.430 |
40 rows × 35 columns
# Let us create a dataframe with random variables
np.random.seed(1)
n = 2500
df = pd.DataFrame(
{'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))),
'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
'education': list(np.random.choice(["High School", "Undergrad", "Grad"], size=(n))),
'housing': list(np.random.choice(["Rent", "Own"], size=(n))),
'height': list(np.random.randint(140,200,n)),
'weight': list(np.random.randint(100,150,n)),
'income': list(np.random.randint(50,250,n)),
'computers': list(np.random.randint(0,6,n))
})
df
state | gender | education | housing | height | weight | income | computers | |
---|---|---|---|---|---|---|---|---|
0 | Florida | Female | High School | Own | 159 | 102 | 96 | 2 |
1 | New York | Male | High School | Rent | 164 | 125 | 76 | 5 |
2 | New York | Male | Undergrad | Own | 165 | 144 | 113 | 4 |
3 | Florida | Female | Undergrad | Rent | 188 | 128 | 136 | 5 |
4 | Florida | Female | Grad | Rent | 183 | 117 | 170 | 5 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
2495 | New York | Male | High School | Rent | 168 | 122 | 132 | 3 |
2496 | New York | Male | Undergrad | Rent | 156 | 139 | 161 | 5 |
2497 | California | Male | Undergrad | Own | 144 | 113 | 242 | 4 |
2498 | Florida | Female | Undergrad | Own | 186 | 136 | 172 | 1 |
2499 | New York | Female | Undergrad | Rent | 167 | 109 | 161 | 5 |
2500 rows × 8 columns
df.pivot_table(index = ['gender'],
columns = ['education'],
values = ['income'],
aggfunc = ['mean'])
mean | |||
---|---|---|---|
income | |||
education | Grad | High School | Undergrad |
gender | |||
Female | 152.045161 | 147.530364 | 150.780622 |
Male | 150.782609 | 151.890625 | 151.585227 |
df.pivot_table(index = ['state'],
columns = ['education', 'housing'],
values = ['gender', 'computers'],
aggfunc = {'gender': [len], 'computers': [np.median, 'mean']})
computers | gender | |||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
mean | median | len | ||||||||||||||||
education | Grad | High School | Undergrad | Grad | High School | Undergrad | Grad | High School | Undergrad | |||||||||
housing | Own | Rent | Own | Rent | Own | Rent | Own | Rent | Own | Rent | Own | Rent | Own | Rent | Own | Rent | Own | Rent |
state | ||||||||||||||||||
California | 2.659574 | 2.527778 | 2.550000 | 2.446667 | 2.335484 | 2.634615 | 3.0 | 3.0 | 3.0 | 2.0 | 2.0 | 3.0 | 141 | 144 | 120 | 150 | 155 | 156 |
Florida | 2.344538 | 2.647059 | 2.414815 | 2.365672 | 2.674242 | 2.246377 | 2.0 | 3.0 | 2.0 | 2.0 | 3.0 | 2.0 | 119 | 136 | 135 | 134 | 132 | 138 |
New York | 2.524194 | 2.317073 | 2.291667 | 2.564885 | 2.520270 | 2.505882 | 3.0 | 2.0 | 2.0 | 3.0 | 3.0 | 2.0 | 124 | 123 | 144 | 131 | 148 | 170 |
Pivot
Pivot is a simpler version of pivot_table. It cannot do any aggregation function, it just shows the values of the 'value' columns at the intersection of the 'index' and the 'columns'.
There are three parameters for pivot:
1. index - which columns in the dataframe should be the index. This is optional. If not specified, it uses the index of the dataframe.
2. columns - which dataframe columns should appear on the top as columns in the result. For each entry in the column parameter, it will create a separate column for each unique value of that column. So if 'carb' can be 1, 2 or 4, it will show 1, 2 and 4 on the top.
3. values - which column's values to show at the intersection of index and columns. If there is more than one value (even if the multiple values are identical), pivot will throw an error. (for example, in mtcars_small, if yuou put cyl 4,6,8 on the left as index, and am 0,1 on the top as columns, and mpg as values, you have two cars at their intersection.)
Pivot can be better than pivot_table as it brings in the value at the intersection of index and columns as-is, which is what you need sometimes without having to add, mean, or count them.
mtcars = sm.datasets.get_rdataset('mtcars').data
mtcars.head()
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
mtcars = mtcars.reset_index().rename(columns={'index': 'car'})
mtcars
car | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
5 | Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
6 | Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
7 | Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
8 | Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
9 | Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
10 | Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
11 | Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
12 | Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
13 | Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
14 | Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
15 | Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
16 | Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
17 | Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
18 | Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
19 | Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
20 | Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
21 | Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
22 | AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
23 | Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
24 | Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
25 | Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
26 | Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
27 | Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
28 | Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
29 | Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
30 | Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
31 | Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
mtcars_small = mtcars.iloc[1:8, [0, 1, 2, 4, 8 , 9]]
mtcars_small
car | mpg | cyl | hp | vs | am | |
---|---|---|---|---|---|---|
1 | Mazda RX4 Wag | 21.0 | 6 | 110 | 0 | 1 |
2 | Datsun 710 | 22.8 | 4 | 93 | 1 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 110 | 1 | 0 |
4 | Hornet Sportabout | 18.7 | 8 | 175 | 0 | 0 |
5 | Valiant | 18.1 | 6 | 105 | 1 | 0 |
6 | Duster 360 | 14.3 | 8 | 245 | 0 | 0 |
7 | Merc 240D | 24.4 | 4 | 62 | 1 | 0 |
mtcars_small.pivot(index = 'car', columns = 'cyl', values = 'mpg')
cyl | 4 | 6 | 8 |
---|---|---|---|
car | |||
Datsun 710 | 22.8 | NaN | NaN |
Duster 360 | NaN | NaN | 14.3 |
Hornet 4 Drive | NaN | 21.4 | NaN |
Hornet Sportabout | NaN | NaN | 18.7 |
Mazda RX4 Wag | NaN | 21.0 | NaN |
Merc 240D | 24.4 | NaN | NaN |
Valiant | NaN | 18.1 | NaN |
mtcars_small.pivot(index = 'car', columns = 'cyl')
mpg | hp | vs | am | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
cyl | 4 | 6 | 8 | 4 | 6 | 8 | 4 | 6 | 8 | 4 | 6 | 8 |
car | ||||||||||||
Datsun 710 | 22.8 | NaN | NaN | 93.0 | NaN | NaN | 1.0 | NaN | NaN | 1.0 | NaN | NaN |
Duster 360 | NaN | NaN | 14.3 | NaN | NaN | 245.0 | NaN | NaN | 0.0 | NaN | NaN | 0.0 |
Hornet 4 Drive | NaN | 21.4 | NaN | NaN | 110.0 | NaN | NaN | 1.0 | NaN | NaN | 0.0 | NaN |
Hornet Sportabout | NaN | NaN | 18.7 | NaN | NaN | 175.0 | NaN | NaN | 0.0 | NaN | NaN | 0.0 |
Mazda RX4 Wag | NaN | 21.0 | NaN | NaN | 110.0 | NaN | NaN | 0.0 | NaN | NaN | 1.0 | NaN |
Merc 240D | 24.4 | NaN | NaN | 62.0 | NaN | NaN | 1.0 | NaN | NaN | 0.0 | NaN | NaN |
Valiant | NaN | 18.1 | NaN | NaN | 105.0 | NaN | NaN | 1.0 | NaN | NaN | 0.0 | NaN |
mtcars_small.pivot(index = 'car', columns = ['am'], values=['mpg', 'vs'])
mpg | vs | |||
---|---|---|---|---|
am | 0 | 1 | 0 | 1 |
car | ||||
Datsun 710 | NaN | 22.8 | NaN | 1.0 |
Duster 360 | 14.3 | NaN | 0.0 | NaN |
Hornet 4 Drive | 21.4 | NaN | 1.0 | NaN |
Hornet Sportabout | 18.7 | NaN | 0.0 | NaN |
Mazda RX4 Wag | NaN | 21.0 | NaN | 0.0 |
Merc 240D | 24.4 | NaN | 1.0 | NaN |
Valiant | 18.1 | NaN | 1.0 | NaN |
Sometimes you may wish to use the index of a dataframe directly, as opposed to moving it into its own column first.
df = pd.DataFrame([[0, 1, 2], [2, 3, 5], [6,7,8],],
index=['cat', 'dog', 'cow'],
columns=['weight', 'height', 'age'])
df
weight | height | age | |
---|---|---|---|
cat | 0 | 1 | 2 |
dog | 2 | 3 | 5 |
cow | 6 | 7 | 8 |
df.pivot(index = [ 'weight'], columns = ['height'])
age | |||
---|---|---|---|
height | 1 | 3 | 7 |
weight | |||
0 | 2.0 | NaN | NaN |
2 | NaN | 5.0 | NaN |
6 | NaN | NaN | 8.0 |
# Now also use the native index of the dataframe
df.pivot(index = [df.index, 'weight'], columns = ['height'])
age | ||||
---|---|---|---|---|
height | 1 | 3 | 7 | |
weight | ||||
cat | 0 | 2.0 | NaN | NaN |
cow | 6 | NaN | NaN | 8.0 |
dog | 2 | NaN | 5.0 | NaN |
Now the same thing fails if there are duplicates
df = pd.DataFrame([['cat', 0, 1, 2], ['dog', 2, 3, 5], ['cow', 6,7,8], ['pig', 6,7,8],],
columns=['animal', 'weight', 'height', 'age'])
df
animal | weight | height | age | |
---|---|---|---|---|
0 | cat | 0 | 1 | 2 |
1 | dog | 2 | 3 | 5 |
2 | cow | 6 | 7 | 8 |
3 | pig | 6 | 7 | 8 |
The below will fail as there are duplicates.
df.pivot(index = [ 'weight'], columns = ['height'])
# We consider only the first 3 rows of this new dataframe.
# Look how in the values we have a categorical variable.
df.iloc[:3].pivot(index = 'weight', columns = 'height', values = 'animal')
height | 1 | 3 | 7 |
---|---|---|---|
weight | |||
0 | cat | NaN | NaN |
2 | NaN | dog | NaN |
6 | NaN | NaN | cow |
Crosstab
Cross computes a frequency table given an index and columns of categorical variables (as a data frame column, series, or numpy array). However it is possible to specify an aggfunc as well, that makes it like a pivot_table.
You can pass normalize = True, or index, or columns, and it will normalize based on totals, or by the rows or by the columns.
df = sns.load_dataset('diamonds')
df
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53935 | 0.72 | Ideal | D | SI1 | 60.8 | 57.0 | 2757 | 5.75 | 5.76 | 3.50 |
53936 | 0.72 | Good | D | SI1 | 63.1 | 55.0 | 2757 | 5.69 | 5.75 | 3.61 |
53937 | 0.70 | Very Good | D | SI1 | 62.8 | 60.0 | 2757 | 5.66 | 5.68 | 3.56 |
53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 |
53939 | 0.75 | Ideal | D | SI2 | 62.2 | 55.0 | 2757 | 5.83 | 5.87 | 3.64 |
53940 rows × 10 columns
# Basic
pd.crosstab(df.cut, df.color)
color | D | E | F | G | H | I | J |
---|---|---|---|---|---|---|---|
cut | |||||||
Ideal | 2834 | 3903 | 3826 | 4884 | 3115 | 2093 | 896 |
Premium | 1603 | 2337 | 2331 | 2924 | 2360 | 1428 | 808 |
Very Good | 1513 | 2400 | 2164 | 2299 | 1824 | 1204 | 678 |
Good | 662 | 933 | 909 | 871 | 702 | 522 | 307 |
Fair | 163 | 224 | 312 | 314 | 303 | 175 | 119 |
# With margins
pd.crosstab(df.cut, df.color, margins = True)
color | D | E | F | G | H | I | J | All |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Ideal | 2834 | 3903 | 3826 | 4884 | 3115 | 2093 | 896 | 21551 |
Premium | 1603 | 2337 | 2331 | 2924 | 2360 | 1428 | 808 | 13791 |
Very Good | 1513 | 2400 | 2164 | 2299 | 1824 | 1204 | 678 | 12082 |
Good | 662 | 933 | 909 | 871 | 702 | 522 | 307 | 4906 |
Fair | 163 | 224 | 312 | 314 | 303 | 175 | 119 | 1610 |
All | 6775 | 9797 | 9542 | 11292 | 8304 | 5422 | 2808 | 53940 |
# With margins and normalized
pd.crosstab(df.cut, df.color, margins = True, normalize = True)
color | D | E | F | G | H | I | J | All |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Ideal | 0.052540 | 0.072358 | 0.070931 | 0.090545 | 0.057749 | 0.038802 | 0.016611 | 0.399537 |
Premium | 0.029718 | 0.043326 | 0.043215 | 0.054208 | 0.043752 | 0.026474 | 0.014980 | 0.255673 |
Very Good | 0.028050 | 0.044494 | 0.040119 | 0.042621 | 0.033815 | 0.022321 | 0.012570 | 0.223990 |
Good | 0.012273 | 0.017297 | 0.016852 | 0.016148 | 0.013014 | 0.009677 | 0.005692 | 0.090953 |
Fair | 0.003022 | 0.004153 | 0.005784 | 0.005821 | 0.005617 | 0.003244 | 0.002206 | 0.029848 |
All | 0.125603 | 0.181628 | 0.176900 | 0.209344 | 0.153949 | 0.100519 | 0.052058 | 1.000000 |
# Normalized by index. Rows total to 1. See how the total column 'All' has
# disappeared from rows. But it has remained for the columns
pd.crosstab(df.cut, df.color, margins = True, normalize = 'index')
color | D | E | F | G | H | I | J |
---|---|---|---|---|---|---|---|
cut | |||||||
Ideal | 0.131502 | 0.181105 | 0.177532 | 0.226625 | 0.144541 | 0.097118 | 0.041576 |
Premium | 0.116235 | 0.169458 | 0.169023 | 0.212022 | 0.171126 | 0.103546 | 0.058589 |
Very Good | 0.125228 | 0.198643 | 0.179109 | 0.190283 | 0.150968 | 0.099652 | 0.056117 |
Good | 0.134937 | 0.190175 | 0.185283 | 0.177538 | 0.143090 | 0.106400 | 0.062576 |
Fair | 0.101242 | 0.139130 | 0.193789 | 0.195031 | 0.188199 | 0.108696 | 0.073913 |
All | 0.125603 | 0.181628 | 0.176900 | 0.209344 | 0.153949 | 0.100519 | 0.052058 |
# Normalized by columns
pd.crosstab(df.cut, df.color, margins = True, normalize = 'columns')
color | D | E | F | G | H | I | J | All |
---|---|---|---|---|---|---|---|---|
cut | ||||||||
Ideal | 0.418303 | 0.398387 | 0.400964 | 0.432519 | 0.375120 | 0.386020 | 0.319088 | 0.399537 |
Premium | 0.236605 | 0.238542 | 0.244288 | 0.258944 | 0.284200 | 0.263371 | 0.287749 | 0.255673 |
Very Good | 0.223321 | 0.244973 | 0.226787 | 0.203595 | 0.219653 | 0.222058 | 0.241453 | 0.223990 |
Good | 0.097712 | 0.095233 | 0.095263 | 0.077134 | 0.084538 | 0.096274 | 0.109330 | 0.090953 |
Fair | 0.024059 | 0.022864 | 0.032698 | 0.027807 | 0.036488 | 0.032276 | 0.042379 | 0.029848 |
# You can also pass multiple series for both the index and columns
pd.crosstab([df.cut, df.color], [df.clarity])
clarity | IF | VVS1 | VVS2 | VS1 | VS2 | SI1 | SI2 | I1 | |
---|---|---|---|---|---|---|---|---|---|
cut | color | ||||||||
Ideal | D | 28 | 144 | 284 | 351 | 920 | 738 | 356 | 13 |
E | 79 | 335 | 507 | 593 | 1136 | 766 | 469 | 18 | |
F | 268 | 440 | 520 | 616 | 879 | 608 | 453 | 42 | |
G | 491 | 594 | 774 | 953 | 910 | 660 | 486 | 16 | |
H | 226 | 326 | 289 | 467 | 556 | 763 | 450 | 38 | |
I | 95 | 179 | 178 | 408 | 438 | 504 | 274 | 17 | |
J | 25 | 29 | 54 | 201 | 232 | 243 | 110 | 2 | |
Premium | D | 10 | 40 | 94 | 131 | 339 | 556 | 421 | 12 |
E | 27 | 105 | 121 | 292 | 629 | 614 | 519 | 30 | |
F | 31 | 80 | 146 | 290 | 619 | 608 | 523 | 34 | |
G | 87 | 171 | 275 | 566 | 721 | 566 | 492 | 46 | |
H | 40 | 112 | 118 | 336 | 532 | 655 | 521 | 46 | |
I | 23 | 84 | 82 | 221 | 315 | 367 | 312 | 24 | |
J | 12 | 24 | 34 | 153 | 202 | 209 | 161 | 13 | |
Very Good | D | 23 | 52 | 141 | 175 | 309 | 494 | 314 | 5 |
E | 43 | 170 | 298 | 293 | 503 | 626 | 445 | 22 | |
F | 67 | 174 | 249 | 293 | 466 | 559 | 343 | 13 | |
G | 79 | 190 | 302 | 432 | 479 | 474 | 327 | 16 | |
H | 29 | 115 | 145 | 257 | 376 | 547 | 343 | 12 | |
I | 19 | 69 | 71 | 205 | 274 | 358 | 200 | 8 | |
J | 8 | 19 | 29 | 120 | 184 | 182 | 128 | 8 | |
Good | D | 9 | 13 | 25 | 43 | 104 | 237 | 223 | 8 |
E | 9 | 43 | 52 | 89 | 160 | 355 | 202 | 23 | |
F | 15 | 35 | 50 | 132 | 184 | 273 | 201 | 19 | |
G | 22 | 41 | 75 | 152 | 192 | 207 | 163 | 19 | |
H | 4 | 31 | 45 | 77 | 138 | 235 | 158 | 14 | |
I | 6 | 22 | 26 | 103 | 110 | 165 | 81 | 9 | |
J | 6 | 1 | 13 | 52 | 90 | 88 | 53 | 4 | |
Fair | D | 3 | 3 | 9 | 5 | 25 | 58 | 56 | 4 |
E | 0 | 3 | 13 | 14 | 42 | 65 | 78 | 9 | |
F | 4 | 5 | 10 | 33 | 53 | 83 | 89 | 35 | |
G | 2 | 3 | 17 | 45 | 45 | 69 | 80 | 53 | |
H | 0 | 1 | 11 | 32 | 41 | 75 | 91 | 52 | |
I | 0 | 1 | 8 | 25 | 32 | 30 | 45 | 34 | |
J | 0 | 1 | 1 | 16 | 23 | 28 | 27 | 23 |
Melt
Melt is similar to Stack() but unlike stack it returns a dataframe, not a series with a multi-level index. A huge advantage is that unlike stack, you can freeze some of the columbns and stack the rest.
In melt, you specify id_vars (index variables) - these are the columns that stay untouched, and then the value_vars, that get stacked. If value_vars are not specified, all columns other than id_vars get stacked.
Opposite of melt is pivot. Pivot applies no aggfunc, just lists the values at the intersection of categorical vars it picks up from a melted dataset.
# Let us create a dataframe with random variables
np.random.seed(1)
n = 10
df = pd.DataFrame(
{'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))),
'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
'education': list(np.random.choice(["High School", "Undergrad", "Grad"], size=(n))),
'housing': list(np.random.choice(["Rent", "Own"], size=(n))),
'height': list(np.random.randint(140,200,n)),
'weight': list(np.random.randint(100,150,n)),
'income': list(np.random.randint(50,250,n)),
'computers': list(np.random.randint(0,6,n))
})
df
state | gender | education | housing | height | weight | income | computers | |
---|---|---|---|---|---|---|---|---|
0 | Florida | Female | Undergrad | Own | 197 | 104 | 65 | 1 |
1 | New York | Female | Grad | Rent | 148 | 124 | 114 | 3 |
2 | New York | Female | High School | Rent | 170 | 149 | 246 | 5 |
3 | Florida | Male | Grad | Own | 147 | 143 | 75 | 4 |
4 | Florida | Female | Undergrad | Rent | 143 | 112 | 161 | 3 |
5 | New York | Female | Undergrad | Rent | 146 | 126 | 185 | 5 |
6 | New York | Male | Undergrad | Own | 161 | 116 | 76 | 1 |
7 | Florida | Female | Undergrad | Own | 189 | 145 | 203 | 3 |
8 | New York | Female | Grad | Own | 197 | 141 | 154 | 0 |
9 | Florida | Female | Undergrad | Rent | 143 | 118 | 72 | 0 |
# Just to demonstrate, melt-ing the first five rows of the df
df.head().melt(id_vars = ['state', 'gender'], value_vars = ['computers', 'income'])
state | gender | variable | value | |
---|---|---|---|---|
0 | Florida | Female | computers | 1 |
1 | New York | Female | computers | 3 |
2 | New York | Female | computers | 5 |
3 | Florida | Male | computers | 4 |
4 | Florida | Female | computers | 3 |
5 | Florida | Female | income | 65 |
6 | New York | Female | income | 114 |
7 | New York | Female | income | 246 |
8 | Florida | Male | income | 75 |
9 | Florida | Female | income | 161 |
Groupby
Groupby returns a groupby object, to which other agg functions can be applied.
-
Groupby does the 'split' part in the split-apply-combine framework.
-
You do the 'apply' using an aggregation function against the groupby object.
-
'Combine' doesn't need to be done separately as it is done automatically after the aggregation function is applied.
# Simple example
df.groupby(['state', 'gender']).agg({"height": "mean", "weight": "sum", "housing": "count", "education": "count"})
height | weight | housing | education | ||
---|---|---|---|---|---|
state | gender | ||||
Florida | Female | 168.00 | 479 | 4 | 4 |
Male | 147.00 | 143 | 1 | 1 | |
New York | Female | 165.25 | 540 | 4 | 4 |
Male | 161.00 | 116 | 1 | 1 |
# Aggregation is done only for the columns for which an aggregation function is specified
df.groupby(['state', 'gender']).agg({"height": "mean", "weight": "sum", "housing": "count"})
height | weight | housing | ||
---|---|---|---|---|
state | gender | |||
Florida | Female | 168.00 | 479 | 4 |
Male | 147.00 | 143 | 1 | |
New York | Female | 165.25 | 540 | 4 |
Male | 161.00 | 116 | 1 |
df.groupby(['state', 'gender']).head(1).agg({"height": "mean", "weight": "sum", "housing": "count", "education": "count"})
height 163.25
weight 487.00
housing 4.00
education 4.00
dtype: float64
group = df.groupby(['state', 'gender'])
# How to look at groups in a groupby:
list(group)
[(('Florida', 'Female'),
state gender education housing height weight income computers
0 Florida Female Undergrad Own 197 104 65 1
4 Florida Female Undergrad Rent 143 112 161 3
7 Florida Female Undergrad Own 189 145 203 3
9 Florida Female Undergrad Rent 143 118 72 0),
(('Florida', 'Male'),
state gender education housing height weight income computers
3 Florida Male Grad Own 147 143 75 4),
(('New York', 'Female'),
state gender education housing height weight income computers
1 New York Female Grad Rent 148 124 114 3
2 New York Female High School Rent 170 149 246 5
5 New York Female Undergrad Rent 146 126 185 5
8 New York Female Grad Own 197 141 154 0),
(('New York', 'Male'),
state gender education housing height weight income computers
6 New York Male Undergrad Own 161 116 76 1)]
list(group)[0][1]
state | gender | education | housing | height | weight | income | computers | |
---|---|---|---|---|---|---|---|---|
0 | Florida | Female | Undergrad | Own | 197 | 104 | 65 | 1 |
4 | Florida | Female | Undergrad | Rent | 143 | 112 | 161 | 3 |
7 | Florida | Female | Undergrad | Own | 189 | 145 | 203 | 3 |
9 | Florida | Female | Undergrad | Rent | 143 | 118 | 72 | 0 |
type(group)
pandas.core.groupby.generic.DataFrameGroupBy
# Look at groups in a groupby - more elegant version:
for group_name, combined in group:
print(group_name)
print(combined)
print('\n')
('Florida', 'Female')
state gender education housing height weight income computers
0 Florida Female Undergrad Own 197 104 65 1
4 Florida Female Undergrad Rent 143 112 161 3
7 Florida Female Undergrad Own 189 145 203 3
9 Florida Female Undergrad Rent 143 118 72 0
('Florida', 'Male')
state gender education housing height weight income computers
3 Florida Male Grad Own 147 143 75 4
('New York', 'Female')
state gender education housing height weight income computers
1 New York Female Grad Rent 148 124 114 3
2 New York Female High School Rent 170 149 246 5
5 New York Female Undergrad Rent 146 126 185 5
8 New York Female Grad Own 197 141 154 0
('New York', 'Male')
state gender education housing height weight income computers
6 New York Male Undergrad Own 161 116 76 1
# How to look at a specific group - the group categorical values have to be entered as a tuple
group.get_group(('New York', 'Male'))
state | gender | education | housing | height | weight | income | computers | |
---|---|---|---|---|---|---|---|---|
6 | New York | Male | Undergrad | Own | 161 | 116 | 76 | 1 |
# get the first row of each group
group.first()
education | housing | height | weight | income | computers | ||
---|---|---|---|---|---|---|---|
state | gender | ||||||
Florida | Female | Undergrad | Own | 197 | 104 | 65 | 1 |
Male | Grad | Own | 147 | 143 | 75 | 4 | |
New York | Female | Grad | Rent | 148 | 124 | 114 | 3 |
Male | Undergrad | Own | 161 | 116 | 76 | 1 |
# Get the first record of each group.
# For this to be useful, sort the original df by the right columns before groupby.
group.head(1)
state | gender | education | housing | height | weight | income | computers | |
---|---|---|---|---|---|---|---|---|
0 | Florida | Female | Undergrad | Own | 197 | 104 | 65 | 1 |
1 | New York | Female | Grad | Rent | 148 | 124 | 114 | 3 |
3 | Florida | Male | Grad | Own | 147 | 143 | 75 | 4 |
6 | New York | Male | Undergrad | Own | 161 | 116 | 76 | 1 |
# Summary stats for all groups
group.describe()
height | weight | ... | income | computers | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | ||
state | gender | |||||||||||||||||||||
Florida | Female | 4.0 | 168.00 | 29.051678 | 143.0 | 143.0 | 166.0 | 191.00 | 197.0 | 4.0 | 119.75 | ... | 171.50 | 203.0 | 4.0 | 1.75 | 1.500000 | 0.0 | 0.75 | 2.0 | 3.0 | 3.0 |
Male | 1.0 | 147.00 | NaN | 147.0 | 147.0 | 147.0 | 147.00 | 147.0 | 1.0 | 143.00 | ... | 75.00 | 75.0 | 1.0 | 4.00 | NaN | 4.0 | 4.00 | 4.0 | 4.0 | 4.0 | |
New York | Female | 4.0 | 165.25 | 23.796008 | 146.0 | 147.5 | 159.0 | 176.75 | 197.0 | 4.0 | 135.00 | ... | 200.25 | 246.0 | 4.0 | 3.25 | 2.362908 | 0.0 | 2.25 | 4.0 | 5.0 | 5.0 |
Male | 1.0 | 161.00 | NaN | 161.0 | 161.0 | 161.0 | 161.00 | 161.0 | 1.0 | 116.00 | ... | 76.00 | 76.0 | 1.0 | 1.00 | NaN | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
4 rows × 32 columns
# Or, if you prefer this
group.describe().reset_index()
state | gender | height | ... | income | computers | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |||
0 | Florida | Female | 4.0 | 168.00 | 29.051678 | 143.0 | 143.0 | 166.0 | 191.00 | 197.0 | ... | 171.50 | 203.0 | 4.0 | 1.75 | 1.500000 | 0.0 | 0.75 | 2.0 | 3.0 | 3.0 |
1 | Florida | Male | 1.0 | 147.00 | NaN | 147.0 | 147.0 | 147.0 | 147.00 | 147.0 | ... | 75.00 | 75.0 | 1.0 | 4.00 | NaN | 4.0 | 4.00 | 4.0 | 4.0 | 4.0 |
2 | New York | Female | 4.0 | 165.25 | 23.796008 | 146.0 | 147.5 | 159.0 | 176.75 | 197.0 | ... | 200.25 | 246.0 | 4.0 | 3.25 | 2.362908 | 0.0 | 2.25 | 4.0 | 5.0 | 5.0 |
3 | New York | Male | 1.0 | 161.00 | NaN | 161.0 | 161.0 | 161.0 | 161.00 | 161.0 | ... | 76.00 | 76.0 | 1.0 | 1.00 | NaN | 1.0 | 1.00 | 1.0 | 1.0 | 1.0 |
4 rows × 34 columns
# Get the count of rows in each group.
# You can pd.DataFrame it, and reset_index() to clean up
group.size()
state gender
Florida Female 4
Male 1
New York Female 4
Male 1
dtype: int64
# Getting min and max values in each group using groupby
mtcars = sm.datasets.get_rdataset('mtcars').data
mtcars
mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Merc 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Merc 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Merc 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | 1 | 1 | 4 | 1 |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro Z28 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | 1 | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volvo 142E | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
mtcars.groupby(['cyl']).agg('mean')
mpg | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|
cyl | ||||||||||
4 | 26.663636 | 105.136364 | 82.636364 | 4.070909 | 2.285727 | 19.137273 | 0.909091 | 0.727273 | 4.090909 | 1.545455 |
6 | 19.742857 | 183.314286 | 122.285714 | 3.585714 | 3.117143 | 17.977143 | 0.571429 | 0.428571 | 3.857143 | 3.428571 |
8 | 15.100000 | 353.100000 | 209.214286 | 3.229286 | 3.999214 | 16.772143 | 0.000000 | 0.142857 | 3.285714 | 3.500000 |
# See which rows have the min values in each column of a groupby. The index of the row is returned
# Which in this case is happily the car name, not an integer
mtcars.groupby(['cyl']).idxmin()
mpg | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|
cyl | ||||||||||
4 | Volvo 142E | Toyota Corolla | Honda Civic | Merc 240D | Lotus Europa | Porsche 914-2 | Porsche 914-2 | Merc 240D | Toyota Corona | Datsun 710 |
6 | Merc 280C | Ferrari Dino | Valiant | Valiant | Mazda RX4 | Ferrari Dino | Mazda RX4 | Hornet 4 Drive | Hornet 4 Drive | Hornet 4 Drive |
8 | Cadillac Fleetwood | Merc 450SE | Dodge Challenger | Dodge Challenger | Ford Pantera L | Ford Pantera L | Hornet Sportabout | Hornet Sportabout | Hornet Sportabout | Hornet Sportabout |
mtcars.groupby(['cyl']).idxmax()
mpg | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|
cyl | ||||||||||
4 | Toyota Corolla | Merc 240D | Lotus Europa | Honda Civic | Merc 240D | Merc 230 | Datsun 710 | Datsun 710 | Porsche 914-2 | Merc 240D |
6 | Hornet 4 Drive | Hornet 4 Drive | Ferrari Dino | Merc 280 | Valiant | Valiant | Hornet 4 Drive | Mazda RX4 | Ferrari Dino | Ferrari Dino |
8 | Pontiac Firebird | Cadillac Fleetwood | Maserati Bora | Ford Pantera L | Lincoln Continental | Merc 450SLC | Hornet Sportabout | Ford Pantera L | Ford Pantera L | Maserati Bora |
rename
columns with Groupby
# We continue the above examples to rename the aggregated columns we created using groupby
diamonds.groupby('cut').agg({"price": "sum",
"clarity": "count"}).rename(columns = {"price": "total_price", "clarity": "diamond_count"})
total_price | diamond_count | |
---|---|---|
cut | ||
Ideal | 74513487 | 21551 |
Premium | 63221498 | 13791 |
Very Good | 48107623 | 12082 |
Good | 19275009 | 4906 |
Fair | 7017600 | 1610 |
Pandas Profiling
Profiling our toy dataframe
import ydata_profiling
profile = ydata_profiling.ProfileReport(df, title = 'My EDA', minimal=True).to_file("output.html")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
--
Now check out output.html in your folder. You can right click and open output.html in the browser.
Pandas Profiling on the Diamonds Dataset
# Import libraries and the diamonds dataset
import pandas as pd
import numpy as np
import seaborn as sns
import os
import ydata_profiling
import phik
import matplotlib.pyplot as plt
df = sns.load_dataset('diamonds')
profile = ydata_profiling.ProfileReport(df, title = 'My EDA', minimal=True).to_file("output.html")
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Export report to file: 0%| | 0/1 [00:00<?, ?it/s]
With this, we end our discussion on EDA. We have seen how we can analyze data, get statistics, distributions and identify key themes. Since this is a problem that has to be solved for every day by lots of analysts, there are many libraries devoted to EDA that automate much of the work. We looked at one - pandas_profiling
. If you search, you will find several more, and may even find something that work best for your use case.
If you have been able to follow thus far, you are all set to explore any numerical data in a tabular form.