Data Preparation

Data Manipulation, Transformation and Cleaning

Reshaping Data

A significant challenge with analytics is that the required data is rarely collected with a view to perform analytics. It is mostly intended to support transaction processing, or managing operations. Data is often ‘dirty’, meaning data can be missing, duplicated, in the wrong formats, in different data sources/files, have character-set issues, require additional mappings, and so on.

Data analysts spend a great deal of time cleaning and preparing data for analysis. In traditional business intelligence contexts, data cleaning and transformations are referred to as ETL processes (Extract-Transfer-Load).

Data scientists often set up defined ‘data pipelines’ – a series of data processing steps that ingest and bring data to a desired format and shape.

During the rest of this chapter, we will see how we can select and filter data, understand data types, add and delete columns in tabular data, replace values, deal with missing values, and more. We will do all of this primarily using pandas, and also a few other libraries.

So what is it that we do when we reshape and clean data? While of course this would almost always depend upon the data and what shape we are trying to get it to, there are several common actions that we have to perform that we should know about.

  • Select columns: with a view to reducing the number of data fields we have to deal with by dropping the un-needed columns and retaining only the rest.

  • Selecting rows: filter out observations based on some criteria as to retain only the observations of relevance to us.

  • Change data types: Dates or numbers may be formatted as strings, or categories may appear as numbers. We may need to change the data types to suit our needs.

  • Add columns: We may need to insert new calculated columns, or bring in data from other data sources as additional features or columns to our dataset.

  • Reorder or sort columns and rows: We may need to rearrange the columns and rows in our data to support understanding and presentation.

  • Rename fields: to remove spaces, special characters, or renaming them as to be more humanly readable.

  • Remove duplicates: We may need to remove duplicate observations.

  • Replace values: Often we may have a need to change one value in the data for another, for example, replace United Kingdom with the letters UK.

  • Bin numerical data: We may need to convert numerical data to categories by grouping them into bins. For example, we may like to call homes with 4 to 6 bedromms as 'Large', converting a numerical column to a binned category.

  • Extract unique values: to understand the data better.

  • Combine: with other data sources as to enrich the information we already have.

  • Missing values: We might like to fill in missing values for a more complete dataset, or remove data with missing values from our dataset.

  • Summarize: using groupby or pivot functions to summarize or 'elongate' the data as to make it more suitable for use in subsequent analysis. (Also called melting and casting)

In the end, we want to have the capability to twist and shape data in the way we need it for our analysis. Python provides us tools and libraries that give us incredible flexibility in being able to do so.

Missing values deserve a special mention. We will also look at length on missing values, where we don't have all the data for every observation. What are we to do in such a case? Should we ignore the observations that are missing any data, and risk losing the data that we have, or try to compensate for the missing data using some smart thinking as to keep the available information?

In the rest of this discussion, we will cover some of these techniques. This is of course not a complete list of everything a data analyst is able to do to reshape and reformat data, for such a list would be impossible. Data manipulation is also closely related to feature engineering, which is discussed in a subsequent chapter.

Before we get started, we will create a random dataframe to play with.

The Setup

Usual library imports

import seaborn as sns
import pandas as pd
import numpy as np
import statsmodels.api as sm
import os

Create a DataFrame

We start with creating a dataframe, which is akin to a spreadsheet.

There are many ways to create a dataframe. When you read a datafile using pandas (for example, using pd.read_csv), a dataframe is automatically created.

But sometimes you may not have a datafile, and you may need to create a dataframe using code. The below examples describe several different ways of doing so. The basic construct is pd.DataFrame(data, index, columns). data can be a list, or a dictionary, an array etc. index refers to the names you want to give to the rows. If you don't specify index, pandas will just number them starting with 0. columns means the names of the columns that you want to see, and if you leave them blank, pandas will just use numbers starting with zero.

Some examples of creating a dataframe appear below. You can modify them to your use case as required.

# Here we create a dataframe from a dictionary.  First, we define a dictionary.
# Then we supply the dictionary as data to pd.DataFrame.

data = {'state': ['New York', 'Florida', 'Arizona'],
        'year': ['1999', '2000', '2001'],
        'pop': [100, 121, 132]}
#Check the data type

type(data)
dict
## Convert it to a dataframe

mydf = pd.DataFrame(data)
mydf
state year pop
0 New York 1999 100
1 Florida 2000 121
2 Arizona 2001 132
## or another way...

df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
df
A B C D
0 95 71 21 95
1 55 6 35 53
2 29 98 82 11
3 46 16 94 84
4 10 74 53 51
... ... ... ... ...
95 9 40 23 80
96 54 11 33 38
97 51 68 29 40
98 61 64 3 51
99 17 10 91 25

100 rows × 4 columns

Multiple ways to create the same dataframe

Imagine we have the population of the five boroughs of New York in 2020 as follows:

Borough Population
BRONX 1446788
BROOKLYN 2648452
MANHATTAN 1638281
QUEENS 2330295
STATEN ISLAND 487155

We want this information as a dataframe so we can join it with other information we might have on the boroughs.

Try to create a dataframe from the above data. Several ways to do so listed below.

# See here that we are specifying the index
pop = pd.DataFrame(data = [1446788, 2648452, 1638281, 2330295, 487155], 
             index = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND'], 
             columns = ['population'])

pop
population
BRONX 1446788
BROOKLYN 2648452
MANHATTAN 1638281
QUEENS 2330295
STATEN ISLAND 487155
# Same thing, but here we keep the borough name as a column
pop = pd.DataFrame(data = {'population': [1446788, 2648452, 1638281, 2330295, 487155], 
             'BOROUGH': ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND']},) 

pop
population BOROUGH
0 1446788 BRONX
1 2648452 BROOKLYN
2 1638281 MANHATTAN
3 2330295 QUEENS
4 487155 STATEN ISLAND
# Yet another way of creating the same dataframe

pop = pd.DataFrame({'population': {'BRONX': 1446788,
  'BROOKLYN': 2648452,
  'MANHATTAN': 1638281,
  'QUEENS': 2330295,
  'STATEN ISLAND': 487155}})
pop
population
BRONX 1446788
BROOKLYN 2648452
MANHATTAN 1638281
QUEENS 2330295
STATEN ISLAND 487155
# Same thing as earlier, but with borough name in the index
pop = pd.DataFrame({'population': [1446788, 2648452, 1638281, 2330295, 487155]}, 
                   index = ['BRONX', 'BROOKLYN', 'MANHATTAN', 'QUEENS', 'STATEN ISLAND'],)
pop
population
BRONX 1446788
BROOKLYN 2648452
MANHATTAN 1638281
QUEENS 2330295
STATEN ISLAND 487155
## Get the current working directory

os.getcwd()
'C:\\Users\\user\\Google Drive\\jupyter'

Read the diamonds and mtcars dataframes

The ‘diamonds’ has 50k+ records, each representing a single diamond. The weight and other attributes are available, and so is the price.

The dataset allows us to experiment with a variety of prediction techniques and algorithms. Below are the columns in the dataset, and their description.

Col Description
price price in US dollars (\$326--\$18,823)
carat weight of the diamond (0.2--5.01)
cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color diamond colour, from J (worst) to D (best)
clarity a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
x length in mm (0--10.74)
y width in mm (0--58.9)
z depth in mm (0--31.8)
depth total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43--79)
table width of top of diamond relative to widest point (43--95)
diamonds = sns.load_dataset("diamonds")

We also load the mtcars dataset

Description
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Format
A data frame with 32 observations on 11 (numeric) variables.

[, 1]   mpg     Miles/(US) gallon
[, 2]   cyl     Number of cylinders
[, 3]   disp    Displacement (cu.in.)
[, 4]   hp      Gross horsepower
[, 5]   drat    Rear axle ratio
[, 6]   wt      Weight (1000 lbs)
[, 7]   qsec    1/4 mile time
[, 8]   vs      Engine (0 = V-shaped, 1 = straight)
[, 9]   am      Transmission (0 = automatic, 1 = manual)
[,10]   gear    Number of forward gears
[,11]   carb    Number of carburetors

Source: https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html

# Load the dataset

mtcars = sm.datasets.get_rdataset('mtcars').data
mtcars.head()
mpg cyl disp hp drat wt qsec vs am gear carb
rownames
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

Selecting columns

Datasets may sometimes have hundreds of columns, or features. Many features may be redundant, or unrelated to our analytical needs. Many columns may have too many null values to be of practical use.

Several ways to select columns with Pandas:
- Select a single column with: df['column_name'] - Or multiple columns using: df[['col1', 'col2', 'col3’]] - Or look for a string in a column name df[col for col in df.columns if 'string' in col] - Or select based on column positions with df.iloc[:, 2:3] etc.

diamonds[['carat', 'price', 'color']]
carat price color
0 0.23 326 E
1 0.21 326 E
2 0.23 327 E
3 0.29 334 I
4 0.31 335 J
... ... ... ...
53935 0.72 2757 D
53936 0.72 2757 D
53937 0.70 2757 D
53938 0.86 2757 H
53939 0.75 2757 D

53940 rows × 3 columns

mtcars[['mpg', 'cyl', 'disp', 'hp', 'wt']]
mpg cyl disp hp wt
rownames
Mazda RX4 21.0 6 160.0 110 2.620
Mazda RX4 Wag 21.0 6 160.0 110 2.875
Datsun 710 22.8 4 108.0 93 2.320
Hornet 4 Drive 21.4 6 258.0 110 3.215
Hornet Sportabout 18.7 8 360.0 175 3.440
Valiant 18.1 6 225.0 105 3.460
Duster 360 14.3 8 360.0 245 3.570
Merc 240D 24.4 4 146.7 62 3.190
Merc 230 22.8 4 140.8 95 3.150
Merc 280 19.2 6 167.6 123 3.440
Merc 280C 17.8 6 167.6 123 3.440
Merc 450SE 16.4 8 275.8 180 4.070
Merc 450SL 17.3 8 275.8 180 3.730
Merc 450SLC 15.2 8 275.8 180 3.780
Cadillac Fleetwood 10.4 8 472.0 205 5.250
Lincoln Continental 10.4 8 460.0 215 5.424
Chrysler Imperial 14.7 8 440.0 230 5.345
Fiat 128 32.4 4 78.7 66 2.200
Honda Civic 30.4 4 75.7 52 1.615
Toyota Corolla 33.9 4 71.1 65 1.835
Toyota Corona 21.5 4 120.1 97 2.465
Dodge Challenger 15.5 8 318.0 150 3.520
AMC Javelin 15.2 8 304.0 150 3.435
Camaro Z28 13.3 8 350.0 245 3.840
Pontiac Firebird 19.2 8 400.0 175 3.845
Fiat X1-9 27.3 4 79.0 66 1.935
Porsche 914-2 26.0 4 120.3 91 2.140
Lotus Europa 30.4 4 95.1 113 1.513
Ford Pantera L 15.8 8 351.0 264 3.170
Ferrari Dino 19.7 6 145.0 175 2.770
Maserati Bora 15.0 8 301.0 335 3.570
Volvo 142E 21.4 4 121.0 109 2.780

Select rows (queries)

Row selection is generally more complex, as we need to apply conditions to only select certain rows. Multiple conditions can be applied simultaneously.

Two approaches in Pandas: - The more reliable but verbose method: df[(df.Col1 == 1) & (df.col2 == 6)]. This method allows greater flexibility, particularly when doing string searches inside rows. - Use .query: df.query('conditions separated by & or |'). This method works for most common situations, and the query is easier to construct - Use != for not-equal-to

diamonds.query('carat > 3 & cut == "Premium"')
carat cut color clarity depth table price x y z
19339 3.01 Premium I I1 62.7 58.0 8040 9.10 8.97 5.67
21862 3.01 Premium F I1 62.2 56.0 9925 9.24 9.13 5.73
22428 3.05 Premium E I1 60.9 58.0 10453 9.26 9.25 5.66
24131 3.24 Premium H I1 62.1 58.0 12300 9.44 9.40 5.85
25460 3.01 Premium G SI2 59.8 58.0 14220 9.44 9.37 5.62
25998 4.01 Premium I I1 61.0 61.0 15223 10.14 10.10 6.17
25999 4.01 Premium J I1 62.5 62.0 15223 10.02 9.94 6.24
26534 3.67 Premium I I1 62.4 56.0 16193 9.86 9.81 6.13
27514 3.01 Premium I SI2 60.2 59.0 18242 9.36 9.31 5.62
27638 3.04 Premium I SI2 59.3 60.0 18559 9.51 9.46 5.62
27679 3.51 Premium J VS2 62.5 59.0 18701 9.66 9.63 6.03
27684 3.01 Premium J SI2 60.7 59.0 18710 9.35 9.22 5.64
27685 3.01 Premium J SI2 59.7 58.0 18710 9.41 9.32 5.59
## You can combine the column selection and the row filter.:

diamonds[['carat', 'cut']].query('carat > 3').head()
carat cut
19339 3.01 Premium
21758 3.11 Fair
21862 3.01 Premium
22428 3.05 Premium
22540 3.02 Fair
## Perform some queries on the data.
## The following query applies multiple conditions 
## simultaneously to give us the observations 
## we are interested in.


diamonds.query('cut == "Good" \
               and color =="E" and clarity =="VVS2" \
               and price > 10000')

carat cut color clarity depth table price x y z
24630 1.30 Good E VVS2 62.8 59.0 12967 6.95 6.99 4.38
25828 1.41 Good E VVS2 59.9 61.0 14853 7.21 7.37 4.37
27177 1.50 Good E VVS2 64.3 58.0 17449 7.20 7.13 4.61
## Same query using the more verbose method

diamonds[(diamonds["cut"] == "Good") 
         & (diamonds["color"] == "E") & (diamonds["clarity"] == "VVS2") 
         & (diamonds["price"] > 10000)]
carat cut color clarity depth table price x y z
24630 1.30 Good E VVS2 62.8 59.0 12967 6.95 6.99 4.38
25828 1.41 Good E VVS2 59.9 61.0 14853 7.21 7.37 4.37
27177 1.50 Good E VVS2 64.3 58.0 17449 7.20 7.13 4.61
## Query using string searches

diamonds[(diamonds["cut"] == "Good") & (diamonds["clarity"].str.startswith("V"))]
carat cut color clarity depth table price x y z
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
35 0.23 Good F VS1 58.2 59.0 402 4.06 4.08 2.37
36 0.23 Good E VS1 64.1 59.0 402 3.83 3.85 2.46
42 0.26 Good D VS2 65.2 56.0 403 3.99 4.02 2.61
43 0.26 Good D VS1 58.4 63.0 403 4.19 4.24 2.46
... ... ... ... ... ... ... ... ... ... ...
53840 0.71 Good H VVS2 60.4 63.0 2738 5.69 5.74 3.45
53886 0.70 Good D VS2 58.0 62.0 2749 5.78 5.87 3.38
53895 0.70 Good F VS1 57.8 61.0 2751 5.83 5.79 3.36
53913 0.80 Good G VS2 64.2 58.0 2753 5.84 5.81 3.74
53914 0.84 Good I VS1 63.7 59.0 2753 5.94 5.90 3.77

2098 rows × 10 columns

## Another example

diamonds.query('cut == "Good" \
and color == "E" and clarity == "VVS2" \
and price > 10000')
carat cut color clarity depth table price x y z
24630 1.30 Good E VVS2 62.8 59.0 12967 6.95 6.99 4.38
25828 1.41 Good E VVS2 59.9 61.0 14853 7.21 7.37 4.37
27177 1.50 Good E VVS2 64.3 58.0 17449 7.20 7.13 4.61
## Another example

diamonds[(diamonds['cut'] == 'Good') 
       & (diamonds['color'] == 'E')
         & (diamonds['clarity'] == 'VVS2')
           & (diamonds['price'] >10000)]
carat cut color clarity depth table price x y z
24630 1.30 Good E VVS2 62.8 59.0 12967 6.95 6.99 4.38
25828 1.41 Good E VVS2 59.9 61.0 14853 7.21 7.37 4.37
27177 1.50 Good E VVS2 64.3 58.0 17449 7.20 7.13 4.61

Subsetting with loc and iloc

- subsetting a data frame
loc is label based, ie based on the row index and column names
iloc is row and column number based

Separate the row and column selections by commas. If no comma, then the entire entry is assumed to be for the rows.

diamonds.iloc[:3] ## only for iloc, the range excludes the right hand number.  Here the row with index 3 is excluded.
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
diamonds.iloc[1,2]
'E'
diamonds.loc[1,'cut']
'Premium'
diamonds.loc[1:3, ['cut','depth']] ## Note here you need the square brackets, in the next one you don't
cut depth
1 Premium 59.8
2 Good 56.9
3 Premium 62.4
diamonds.loc[1:3, 'cut':'depth']
cut color clarity depth
1 Premium E SI1 59.8
2 Good E VS1 56.9
3 Premium I VS2 62.4
diamonds.iloc[1:3,2:4] #See the rows and columns that were excluded
color clarity
1 E SI1
2 E VS1

Understanding Data Types

A ‘data type’ is an internal representation of how Python treats and manipulates data. Python and Pandas can be quite forgiving about data types, but incorrect data types can give you incorrect or unpredictable results, or outright errors. Following are the data types used in Pandas:

Pandas data type Notes
bool True/False values
category Levels, or factors, ie, a determinate list of categorical values
datetime64 Date & time representations
float64 Floating point numbers (ie numbers with decimals)
int64 Integers
object Text, or mixed numeric and non-numeric values
timedelta[ns] Difference between two datetimes

Consider the mtcars dataset.

Examining the data types of different columns, we see that cyl (number of cylinders) is an integer. But this feature has only three discrete values, and can be considered a category.

We can convert this column to a category.

Dates often require a similar consideration

mtcars.info()
<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 3.0+ KB

Change column to categorical

x = diamonds[['carat', 'cut']].query('carat > 3')
x.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, 19339 to 27685
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   carat   32 non-null     float64 
 1   cut     32 non-null     category
dtypes: category(1), float64(1)
memory usage: 756.0 bytes
diamonds['cut'] = diamonds['cut'].astype('category')
diamonds.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype   
---  ------   --------------  -----   
 0   carat    53940 non-null  float64 
 1   cut      53940 non-null  category
 2   color    53940 non-null  category
 3   clarity  53940 non-null  category
 4   depth    53940 non-null  float64 
 5   table    53940 non-null  float64 
 6   price    53940 non-null  int64   
 7   x        53940 non-null  float64 
 8   y        53940 non-null  float64 
 9   z        53940 non-null  float64 
dtypes: category(3), float64(6), int64(1)
memory usage: 3.0 MB
diamonds['cut'].cat.categories
Index(['Ideal', 'Premium', 'Very Good', 'Good', 'Fair'], dtype='object')

How to list categories

list(enumerate(diamonds['cut'].cat.categories))
[(0, 'Ideal'), (1, 'Premium'), (2, 'Very Good'), (3, 'Good'), (4, 'Fair')]
 dict(enumerate(diamonds['cut'].cat.categories))
{0: 'Ideal', 1: 'Premium', 2: 'Very Good', 3: 'Good', 4: 'Fair'}
pd.DataFrame(enumerate(diamonds['cut'].cat.categories))
0 1
0 0 Ideal
1 1 Premium
2 2 Very Good
3 3 Good
4 4 Fair
mtcars.info()
<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   mpg     32 non-null     float64
 1   cyl     32 non-null     int64  
 2   disp    32 non-null     float64
 3   hp      32 non-null     int64  
 4   drat    32 non-null     float64
 5   wt      32 non-null     float64
 6   qsec    32 non-null     float64
 7   vs      32 non-null     int64  
 8   am      32 non-null     int64  
 9   gear    32 non-null     int64  
 10  carb    32 non-null     int64  
dtypes: float64(5), int64(6)
memory usage: 3.0+ KB
mtcars['cyl'] = mtcars['cyl'].astype('category')
mtcars.info()
<class 'pandas.core.frame.DataFrame'>
Index: 32 entries, Mazda RX4 to Volvo 142E
Data columns (total 11 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   mpg     32 non-null     float64 
 1   cyl     32 non-null     category
 2   disp    32 non-null     float64 
 3   hp      32 non-null     int64   
 4   drat    32 non-null     float64 
 5   wt      32 non-null     float64 
 6   qsec    32 non-null     float64 
 7   vs      32 non-null     int64   
 8   am      32 non-null     int64   
 9   gear    32 non-null     int64   
 10  carb    32 non-null     int64   
dtypes: category(1), float64(5), int64(5)
memory usage: 2.9+ KB

Get column names

You can list columns using df.columns

# List the column names in the diamonds dataset

diamonds.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')
## Or, list it for a cleaner list

list(diamonds.columns)
['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']
diamonds.columns.tolist()
## same thing as above
['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y', 'z']

Add Columns

Sometimes, you may need to add a column to the data using a calculation.

Example: Add a column norm_carat converting the field carat in the diamonds dataset to a standardized variable.

## Add a column equal to the z-score for the carat variable
diamonds['norm_carat'] = (diamonds['carat'] - diamonds['carat'].mean() )/diamonds['carat'].std()
diamonds.head()
carat cut color clarity depth table price x y z norm_carat
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 -1.198157
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 -1.240350
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 -1.198157
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 -1.071577
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 -1.029384

Add column with insert

Use df.insert(col_location, col_title, contents) to insert at a particular location.

Below we insert a random number at position 3 in the diamonds dataframe.

## mydf.insert(col_location, col_title, contents) does the trick

diamonds.insert(3, "random_value", np.random.randint(0, 100, diamonds.shape[0]))
diamonds.head()
carat cut color random_value clarity depth table price x y z norm_carat
0 0.23 Ideal E 62 SI2 61.5 55.0 326 3.95 3.98 2.43 -1.198157
1 0.21 Premium E 36 SI1 59.8 61.0 326 3.89 3.84 2.31 -1.240350
2 0.23 Good E 5 VS1 56.9 65.0 327 4.05 4.07 2.31 -1.198157
3 0.29 Premium I 79 VS2 62.4 58.0 334 4.20 4.23 2.63 -1.071577
4 0.31 Good J 98 SI2 63.3 58.0 335 4.34 4.35 2.75 -1.029384

Add column with assign

## This does not add the column to the original data frame unless you make it equal to the new one

diamonds.assign(total = diamonds.x + diamonds.y + diamonds.z).head()

carat cut color random_value clarity depth table price x y z norm_carat total
0 0.23 Ideal E 62 SI2 61.5 55.0 326 3.95 3.98 2.43 -1.198157 10.36
1 0.21 Premium E 36 SI1 59.8 61.0 326 3.89 3.84 2.31 -1.240350 10.04
2 0.23 Good E 5 VS1 56.9 65.0 327 4.05 4.07 2.31 -1.198157 10.43
3 0.29 Premium I 79 VS2 62.4 58.0 334 4.20 4.23 2.63 -1.071577 11.06
4 0.31 Good J 98 SI2 63.3 58.0 335 4.34 4.35 2.75 -1.029384 11.44
## Notice the 'total' column from above is not there, .

diamonds.head(2)

carat cut color random_value clarity depth table price x y z norm_carat
0 0.23 Ideal E 62 SI2 61.5 55.0 326 3.95 3.98 2.43 -1.198157
1 0.21 Premium E 36 SI1 59.8 61.0 326 3.89 3.84 2.31 -1.240350

Delete Columns

Sometimes, you may need to remove certain columns to remove unwanted information, or to reduce the size of the dataset.

del df['ColName'] does the trick for a single column.

Similarly, df.drop(['ColName'], axis = 1, inplace = True) can be used to delete multiple columns. It can also be used to delete rows.

# Example: delete the 'norm_carat' column we inserted earlier

del diamonds['norm_carat']
## Or we can use the drop command

diamonds.drop('random_value', axis=1, inplace=True)
# Let us now look a the list of columns that are left

diamonds.columns
Index(['carat', 'cut', 'color', 'clarity', 'depth', 'table', 'price', 'x', 'y',
       'z'],
      dtype='object')

Reordering Columns

Sometimes you may need to rearrange the order in which columns appear to make them easier to read. In Pandas, that can be done by assigning the new column order to the same dataframe.

Example: Make price the first column in the diamonds dataset.

diamonds = diamonds[['price', 'carat', 'cut', 'color', 'clarity', 'depth', 'table',  'x', 'y','z']]
diamonds
price carat cut color clarity depth table x y z
0 326 0.23 Ideal E SI2 61.5 55.0 3.95 3.98 2.43
1 326 0.21 Premium E SI1 59.8 61.0 3.89 3.84 2.31
2 327 0.23 Good E VS1 56.9 65.0 4.05 4.07 2.31
3 334 0.29 Premium I VS2 62.4 58.0 4.20 4.23 2.63
4 335 0.31 Good J SI2 63.3 58.0 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 2757 0.72 Ideal D SI1 60.8 57.0 5.75 5.76 3.50
53936 2757 0.72 Good D SI1 63.1 55.0 5.69 5.75 3.61
53937 2757 0.70 Very Good D SI1 62.8 60.0 5.66 5.68 3.56
53938 2757 0.86 Premium H SI2 61.0 58.0 6.15 6.12 3.74
53939 2757 0.75 Ideal D SI2 62.2 55.0 5.83 5.87 3.64

53940 rows × 10 columns

Sorting Values

Sorting rows by their values is a common task.

Example: Sort the diamonds dataset by price and carat weight so that the former is sorted in ascending order, and the latter in ascending order.

Format is df.sort_values(['a', 'b'], ascending=[True, False])

diamonds.sort_values(['carat', 'price', 'depth'], ascending = [False, False, False]).head(8)
price carat cut color clarity depth table x y z
27415 18018 5.01 Fair J I1 65.5 59.0 10.74 10.54 6.98
27630 18531 4.50 Fair J I1 65.8 58.0 10.23 10.16 6.72
27130 17329 4.13 Fair H I1 64.8 61.0 10.00 9.85 6.43
25999 15223 4.01 Premium J I1 62.5 62.0 10.02 9.94 6.24
25998 15223 4.01 Premium I I1 61.0 61.0 10.14 10.10 6.17
26444 15984 4.00 Very Good I I1 63.3 58.0 10.01 9.94 6.31
26534 16193 3.67 Premium I I1 62.4 56.0 9.86 9.81 6.13
23644 11668 3.65 Fair H I1 67.1 53.0 9.53 9.48 6.38
# Another example of sorting, but with the change updating the dataframe through 'inplace=True'

diamonds.sort_values(['price', 'carat'], ascending=[True, False], inplace = True)
# Let us look at how the sorted dataset looks like

diamonds
price carat cut color clarity depth table x y z
0 326 0.23 Ideal E SI2 61.5 55.0 3.95 3.98 2.43
1 326 0.21 Premium E SI1 59.8 61.0 3.89 3.84 2.31
2 327 0.23 Good E VS1 56.9 65.0 4.05 4.07 2.31
3 334 0.29 Premium I VS2 62.4 58.0 4.20 4.23 2.63
4 335 0.31 Good J SI2 63.3 58.0 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
27745 18803 2.00 Very Good H SI1 62.8 57.0 7.95 8.00 5.01
27746 18804 2.07 Ideal G SI2 62.5 55.0 8.20 8.13 5.11
27747 18806 1.51 Ideal G IF 61.7 55.0 7.37 7.41 4.56
27748 18818 2.00 Very Good G SI1 63.5 56.0 7.90 7.97 5.04
27749 18823 2.29 Premium I VS2 60.8 60.0 8.50 8.47 5.16

53940 rows × 10 columns

Renaming Columns

Renaming columns is often needed to remove spaces in column names, or make everything lowercase, or to provide more descriptive or concise column headings.

Can be done by passing a dictionary of old_name: new_name to the rename function.

Example: Rename price to dollars, and carat to weight in the diamonds dataset.

df.rename(columns={'oldName1': 'newName1', 'oldName2': 'newName2'}, inplace=True)
diamonds.rename(columns = {'price': 'dollars', 'carat': 'weight'})
dollars weight cut color clarity depth table x y z
0 326 0.23 Ideal E SI2 61.5 55.0 3.95 3.98 2.43
1 326 0.21 Premium E SI1 59.8 61.0 3.89 3.84 2.31
2 327 0.23 Good E VS1 56.9 65.0 4.05 4.07 2.31
3 334 0.29 Premium I VS2 62.4 58.0 4.20 4.23 2.63
4 335 0.31 Good J SI2 63.3 58.0 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
27745 18803 2.00 Very Good H SI1 62.8 57.0 7.95 8.00 5.01
27746 18804 2.07 Ideal G SI2 62.5 55.0 8.20 8.13 5.11
27747 18806 1.51 Ideal G IF 61.7 55.0 7.37 7.41 4.56
27748 18818 2.00 Very Good G SI1 63.5 56.0 7.90 7.97 5.04
27749 18823 2.29 Premium I VS2 60.8 60.0 8.50 8.47 5.16

53940 rows × 10 columns

diamonds.rename(columns={'price': 'dollars'}, inplace=False)
dollars carat cut color clarity depth table x y z
0 326 0.23 Ideal E SI2 61.5 55.0 3.95 3.98 2.43
1 326 0.21 Premium E SI1 59.8 61.0 3.89 3.84 2.31
2 327 0.23 Good E VS1 56.9 65.0 4.05 4.07 2.31
3 334 0.29 Premium I VS2 62.4 58.0 4.20 4.23 2.63
4 335 0.31 Good J SI2 63.3 58.0 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
27745 18803 2.00 Very Good H SI1 62.8 57.0 7.95 8.00 5.01
27746 18804 2.07 Ideal G SI2 62.5 55.0 8.20 8.13 5.11
27747 18806 1.51 Ideal G IF 61.7 55.0 7.37 7.41 4.56
27748 18818 2.00 Very Good G SI1 63.5 56.0 7.90 7.97 5.04
27749 18823 2.29 Premium I VS2 60.8 60.0 8.50 8.47 5.16

53940 rows × 10 columns

Removing Duplicates

Duplicate rows may often be found in a DataFrame.

pandas.drop_duplicates() returns a DataFrame where all duplicated rows are removed.

By default, all columns are considered, but you can limit the duplication detection to a subset of columns. If a subset of columns is used, the first observed value will be retained. This behavior can be changed by specifying keep = 'last'.

Let us consider an example. We create a toy dataframe as below:

df = pd.DataFrame({'state': ['NY', 'NJ', 'NJ', 'NJ', 'CT', 'CT', 'CT'],
                  'variable1': [2, 3, 2, 2, 4, 6, 6],
                  'variable2': [10, 22, 22, 24, 11, 24, 24]})
df
state variable1 variable2
0 NY 2 10
1 NJ 3 22
2 NJ 2 22
3 NJ 2 24
4 CT 4 11
5 CT 6 24
6 CT 6 24

Note the below:

image.png

## Dropping duplicates considering all columns

df.drop_duplicates()
state variable1 variable2
0 NY 2 10
1 NJ 3 22
2 NJ 2 22
3 NJ 2 24
4 CT 4 11
5 CT 6 24
## Dropping duplicates considering only the first two columns.

## (Note which of the two NJ records was retained!)

df.drop_duplicates(['state', 'variable1'])
state variable1 variable2
0 NY 2 10
1 NJ 3 22
2 NJ 2 22
4 CT 4 11
5 CT 6 24

Replacing Values

We often need to replace values in data. For example, we may know that 0 means the data is not available, and may wish to replace zeros with NaN.

Let us replace state names in our previous toy dataset with their full names.

We do this using df.replace({'NJ': 'New Jersey', 'NY': 'New York', 'CT': 'Connecticut’}).

Note that this affects the entire dataframe, so be careful to select the right column name if you want the replacement to occur in only one column!

df.replace({'NJ': 'New Jersey', 'NY': 'New York', 'CT': 'Connecticut'})
state variable1 variable2
0 New York 2 10
1 New Jersey 3 22
2 New Jersey 2 22
3 New Jersey 2 24
4 Connecticut 4 11
5 Connecticut 6 24
6 Connecticut 6 24

Binning using Pandas cut Function

We often need to convert continuous variables into categories.

For example, reconsider our toy dataset. Consider variable1. Say we want anything 2 or lower to be labeled as Small, 2 to 4 as Medium, and 4 to 6 as Large.

We can do this as follows: pd.cut(df.variable1, [0, 2, 4, 6], labels = ["Small", "Medium", "Large"])

Consider variable1. Say we want anything 2 or lower to be labeled as Small, 2 to 4 as Medium, and 4 to 6 as Large.

We use the pd.cut function to create the bins, and assign them labels as below.

If no labels are specified, the mathematical notation for intervals will be used for the bins, ie [(0, 2] < (2, 4] < (4, 6]] ( ( means does not include, and [ means includes)

df['NewColumn'] = pd.cut(df.variable1, [0, 2, 4, 6], labels = ["Small", "Medium", "Large"])
df
state variable1 variable2 NewColumn
0 NY 2 10 Small
1 NJ 3 22 Medium
2 NJ 2 22 Small
3 NJ 2 24 Small
4 CT 4 11 Medium
5 CT 6 24 Large
6 CT 6 24 Large
df
state variable1 variable2 NewColumn
0 NY 2 10 Small
1 NJ 3 22 Medium
2 NJ 2 22 Small
3 NJ 2 24 Small
4 CT 4 11 Medium
5 CT 6 24 Large
6 CT 6 24 Large

Binning into Quantiles using qcut

Sometimes we may desire an equal number of observations in our bins.

In such cases, we can use quantiles as bins but then the intervals may not be equal (though the count of observations in each bin may be similar.

We can also specify arbitrary quantiles as bins.

In Pandas, use pd.qcut(df.variable1, number_of_quantiles) to achieve this.

pd.qcut(df.variable1, 2)
0    (1.999, 3.0]
1    (1.999, 3.0]
2    (1.999, 3.0]
3    (1.999, 3.0]
4      (3.0, 6.0]
5      (3.0, 6.0]
6      (3.0, 6.0]
Name: variable1, dtype: category
Categories (2, interval[float64, right]): [(1.999, 3.0] < (3.0, 6.0]]
pd.qcut(diamonds.price, 4)
0          (325.999, 950.0]
1          (325.999, 950.0]
2          (325.999, 950.0]
3          (325.999, 950.0]
4          (325.999, 950.0]
                ...        
27745    (5324.25, 18823.0]
27746    (5324.25, 18823.0]
27747    (5324.25, 18823.0]
27748    (5324.25, 18823.0]
27749    (5324.25, 18823.0]
Name: price, Length: 53940, dtype: category
Categories (4, interval[float64, right]): [(325.999, 950.0] < (950.0, 2401.0] < (2401.0, 5324.25] < (5324.25, 18823.0]]
diamonds['quartiles']=pd.qcut(diamonds.price, 4)
diamonds['Price_Category']=pd.qcut(diamonds.price, 4, labels=['Economy','Affordable','Pricey','Expensive'])
diamonds.sample(8)
price carat cut color clarity depth table x y z quartiles Price_Category
27174 17442 1.76 Ideal G VS2 60.3 57.0 7.90 7.83 4.74 (5324.25, 18823.0] Expensive
34257 857 0.30 Ideal F VVS1 62.3 56.0 4.30 4.34 2.69 (325.999, 950.0] Economy
21817 9891 1.31 Ideal G VS1 61.5 57.0 7.02 7.06 4.33 (5324.25, 18823.0] Expensive
27277 17730 2.12 Ideal F SI2 62.3 57.0 8.19 8.22 5.11 (5324.25, 18823.0] Expensive
37561 988 0.31 Premium G IF 61.7 54.0 4.36 4.33 2.68 (950.0, 2401.0] Affordable
44087 1554 0.50 Ideal E VS2 62.1 55.0 5.11 5.13 3.18 (950.0, 2401.0] Affordable
24935 13387 2.01 Ideal E SI2 62.1 57.0 8.00 7.88 4.93 (5324.25, 18823.0] Expensive
7607 4259 0.84 Very Good G VVS2 60.4 55.0 6.07 6.12 3.68 (2401.0, 5324.25] Pricey
# ..combining pandas functions to get a list of the unique quartiles and price_category columns

diamonds[['quartiles', 'Price_Category']].drop_duplicates().sort_values(by='quartiles').reset_index()
index quartiles Price_Category
0 0 (325.999, 950.0] Economy
1 36691 (950.0, 2401.0] Affordable
2 51735 (2401.0, 5324.25] Pricey
3 12766 (5324.25, 18823.0] Expensive
# ..looking at counts in each category
diamonds.groupby(['quartiles', 'Price_Category']).agg({"carat":"count"}).query('carat>0').rename({'carat':'Count of Diamonds'}, axis=1)
C:\Users\user\AppData\Local\Temp\ipykernel_21280\3774040384.py:2: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
  diamonds.groupby(['quartiles', 'Price_Category']).agg({"carat":"count"}).query('carat>0').rename({'carat':'Count of Diamonds'}, axis=1)
Count of Diamonds
quartiles Price_Category
(325.999, 950.0] Economy 13490
(950.0, 2401.0] Affordable 13495
(2401.0, 5324.25] Pricey 13470
(5324.25, 18823.0] Expensive 13485

Select a random sample of data

Often, we need to look at a sample of the observations. df.head() and df.tail() produce the same records each time, and often it is good to look at other data.

df.sample(n) gives you a random set of n observations from the data.

## Either a discrete number of rows,
diamonds.sample(3)
price carat cut color clarity depth table x y z quartiles Price_Category
9847 4676 0.91 Premium E VS2 62.2 60.0 6.21 6.13 3.84 (2401.0, 5324.25] Pricey
48494 1981 0.52 Ideal F VS2 61.5 56.0 5.19 5.21 3.20 (950.0, 2401.0] Affordable
38402 1024 0.35 Premium H VVS1 60.6 60.0 4.57 4.51 2.75 (950.0, 2401.0] Affordable
## or a fraction of the total data
diamonds.sample(frac = 0.00015)
price carat cut color clarity depth table x y z quartiles Price_Category
13413 5510 1.20 Ideal H SI1 62.9 57.0 6.79 6.76 4.26 (5324.25, 18823.0] Expensive
37504 984 0.41 Premium H SI1 61.8 59.0 4.81 4.77 2.96 (950.0, 2401.0] Affordable
24608 12931 2.06 Premium H SI2 62.6 58.0 8.06 8.03 5.04 (5324.25, 18823.0] Expensive
22350 628 0.31 Very Good I VS1 63.5 57.0 4.28 4.26 2.71 (325.999, 950.0] Economy
15933 6371 1.25 Ideal G SI2 61.1 55.0 6.95 6.99 4.26 (5324.25, 18823.0] Expensive
19798 8365 1.25 Ideal H VS1 62.1 57.0 6.89 6.92 4.29 (5324.25, 18823.0] Expensive
53055 2607 0.72 Ideal F VS2 61.1 56.0 5.83 5.79 3.55 (2401.0, 5324.25] Pricey
49293 539 0.33 Ideal H VS2 61.1 56.0 4.50 4.54 2.76 (325.999, 950.0] Economy

String operations with Pandas

We often have to combine text data, split it, find certain types of text, and perform various other functions on strings.

String munging operations often take up a lot of time when cleaning data.

Pandas offers a number of methods to perform string operations – refer list to the right.

image.png

Source: Python for Data Analysis, Wes McKinney

Value Counts

## value_counts provide the frequency for categorical variables
mtcars.cyl.value_counts()
cyl
8    14
4    11
6     7
Name: count, dtype: int64
## ...and we can get percentages instead too
mtcars.cyl.value_counts(normalize=True)
cyl
8    0.43750
4    0.34375
6    0.21875
Name: proportion, dtype: float64
# Just checking what range(60,64) returns

list(range(60,64))
[60, 61, 62, 63]
## We can specify bins for numerical variables
# Below, we are saying give us value counts for 60-61, 61-62, 62-63.

diamonds = sns.load_dataset("diamonds")

diamonds.depth.value_counts(bins = range(60,64))
depth
(61.0, 62.0]      17945
(62.0, 63.0]      15348
(59.999, 61.0]     8451
Name: count, dtype: int64
# Let us create a random vehicle crash dataframe with three columns showing
# city, cause of crash, and the dollar loss from the accident.
# Some of the values are missing, and indicated as a NaN (we create Nan
# values using np.nan)

df = pd.DataFrame(data = {'city': ['NYC', np.nan, 'Boston', 'Boston', 
                                   'WashingtonDC', np.nan, 'Boston', 
                                   'NYC', 'Boston', 'NYC'],
                         'cause': ['distracted', 'drowsy', 'drowsy', 
                                   np.nan, 'drunk', 'distracted', 
                                   'distracted', np.nan, np.nan, 'drunk'],
                         'dollar_loss': [8194, 4033, 9739, 4876, 4421, 
                                         6094, 5080, 2909, 9712, 2450]})

df
city cause dollar_loss
0 NYC distracted 8194
1 NaN drowsy 4033
2 Boston drowsy 9739
3 Boston NaN 4876
4 WashingtonDC drunk 4421
5 NaN distracted 6094
6 Boston distracted 5080
7 NYC NaN 2909
8 Boston NaN 9712
9 NYC drunk 2450
# Let us check the value_counts by city
# By default, missing values are ignored.
# So you don't see the NaNs.
# You can address it by setting dropna = False, as in the next cell

df.city.value_counts()
city
Boston          4
NYC             3
WashingtonDC    1
Name: count, dtype: int64
df.city.value_counts(dropna = False)
city
Boston          4
NYC             3
NaN             2
WashingtonDC    1
Name: count, dtype: int64
# Instead of counts, you can ask for percentages (expressed as a decimal)

df.city.value_counts(dropna = False, normalize = True)
city
Boston          0.4
NYC             0.3
NaN             0.2
WashingtonDC    0.1
Name: proportion, dtype: float64
# Sometimes, you may use the cumulative sum function to get 
# totals upto that value.  Try to run the cell to understand what it does.

df.city.value_counts().cumsum()
city
Boston          4
NYC             7
WashingtonDC    8
Name: count, dtype: int64
df.city.value_counts(dropna = False, normalize = True).cumsum()
city
Boston          0.4
NYC             0.7
NaN             0.9
WashingtonDC    1.0
Name: proportion, dtype: float64

Extract unique values

diamonds.cut.unique()
## You can put `tolist()` at the end to get a cleaner output, or enclose everything in `list`.
['Ideal', 'Premium', 'Good', 'Very Good', 'Fair']
Categories (5, object): ['Ideal', 'Premium', 'Very Good', 'Good', 'Fair']

Groupby

groupby and rename use as parameters the dict format, which is curly brackets, and key : value, each within quotes.

For example, {"old_colname" : "new_colname", "old_colname2" : "new_colname2"}

mydf = diamonds.groupby('cut', observed=True)
summ = mydf.agg({"price": "sum", "clarity": "count", "table": "mean"}) 
summ
price clarity table
cut
Ideal 74513487 21551 55.951668
Premium 63221498 13791 58.746095
Very Good 48107623 12082 57.956150
Good 19275009 4906 58.694639
Fair 7017600 1610 59.053789
## Alternatively, everything could be combined together:
diamonds.groupby('cut', observed=True).agg({"price": "sum", "clarity": "count", "table": "mean"})
price clarity table
cut
Ideal 74513487 21551 55.951668
Premium 63221498 13791 58.746095
Very Good 48107623 12082 57.956150
Good 19275009 4906 58.694639
Fair 7017600 1610 59.053789
## Or, groupby two variables:
diamonds.groupby(['cut', 'color'], observed=True).agg({"price": "sum", "clarity": "count", "table": "mean"})
price clarity table
cut color
Ideal D 7450854 2834 55.965632
E 10138238 3903 55.967461
F 12912518 3826 55.924203
G 18171930 4884 55.902375
H 12115278 3115 55.965843
I 9317974 2093 56.021357
J 4406695 896 56.012612
Premium D 5820962 1603 58.718964
E 8270443 2337 58.779461
F 10081319 2331 58.679279
G 13160170 2924 58.702360
H 12311428 2360 58.792034
I 8491146 1428 58.771849
J 5086030 808 58.874752
Very Good D 5250817 1513 58.041309
E 7715165 2400 58.038875
F 8177367 2164 57.848429
G 8903461 2299 57.784428
H 8272552 1824 57.903015
I 6328079 1204 58.105150
J 3460182 678 58.277729
Good D 2254363 662 58.541541
E 3194260 933 58.779957
F 3177637 909 58.910891
G 3591553 871 58.471986
H 3001931 702 58.611111
I 2650994 522 58.773946
J 1404271 307 58.813029
Fair D 699443 163 58.969325
E 824838 224 59.364732
F 1194025 312 59.453205
G 1331126 314 58.773248
H 1556112 303 58.696370
I 819953 175 59.237143
J 592103 119 58.917647

rename columns with Groupby

## We continue the above examples to rename the aggregated columns we created using groupby

diamonds.groupby('cut', observed=True).agg({"price": "sum", 
                             "clarity": "count"}).rename(columns = {"price": "total_price", "clarity": "diamond_count"})
total_price diamond_count
cut
Ideal 74513487 21551
Premium 63221498 13791
Very Good 48107623 12082
Good 19275009 4906
Fair 7017600 1610

Joining Data with Merge

The data analyst often has to combine data frames much in the same way as database join operations work, which requires connecting two tables based on a reference field.

(joins of all types are discussed here: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.htmlhttps://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html)

image.png

  • Outer Join = A + B + C (keep everything)
  • Inner Join = B (keep only the intersection)
  • Left Join = A + B (keep all from the left)
  • Right Join = B + C (keep all from the right)

You can join data using the pandas merge function.

image.png


Consider the below data frames. The first one is the left data frame, and the second one is the right data frame.

Next consider the four types of joins - left, right, inner and outer.

## Left data frame

np.random.seed(2)
n = 5
df = pd.DataFrame(
    {'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))), 
     'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
     'housing': list(np.random.choice(["Rent", "Own"], size=(n))),     
     'height': list(np.random.randint(140,200,n))
     })
## Left data frame

df
state gender housing height
0 New York Female Own 177
1 Florida Male Rent 179
2 New York Male Rent 143
3 California Female Rent 178
4 California Male Rent 144
## Right data frame

df_rent = pd.DataFrame(
          {'state': ["Connecticut", "Florida", "California"],
           'avg_rent': [3500, 2200, 4500]})
## Right data frame

df_rent
state avg_rent
0 Connecticut 3500
1 Florida 2200
2 California 4500

Left Join

df.merge(df_rent, how = 'left' , 
         left_on = 'state', right_on = 'state')
state gender housing height avg_rent
0 New York Female Own 177 NaN
1 Florida Male Rent 179 2200.0
2 New York Male Rent 143 NaN
3 California Female Rent 178 4500.0
4 California Male Rent 144 4500.0

Right Join

df.merge(df_rent, how = 'right' , 
         left_on = 'state', right_on = 'state')
state gender housing height avg_rent
0 Connecticut NaN NaN NaN 3500
1 Florida Male Rent 179.0 2200
2 California Female Rent 178.0 4500
3 California Male Rent 144.0 4500

Inner Join

df.merge(df_rent, how = 'inner' , 
         left_on = 'state', right_on = 'state')
state gender housing height avg_rent
0 Florida Male Rent 179 2200
1 California Female Rent 178 4500
2 California Male Rent 144 4500

Outer Join

df.merge(df_rent, how = 'outer' , 
         left_on = 'state', right_on = 'state')
state gender housing height avg_rent
0 New York Female Own 177.0 NaN
1 New York Male Rent 143.0 NaN
2 Florida Male Rent 179.0 2200.0
3 California Female Rent 178.0 4500.0
4 California Male Rent 144.0 4500.0
5 Connecticut NaN NaN NaN 3500.0

Concatenation

Sometimes we need to simply combine datasets without any fancy operations.

Imagine you have 3 files, each for a different month, and you need to stack them vertically one after the other.

Occasionally, you may need to stack datasets horizontally, ie right next to each other. For example, imagine you have 2 files, one with names and ages, and the other with names and income. You may just want to ‘stack’ the data next to each other.

As a common operation, this can be done with Pandas’s concat() command.

We use the same df and df_rent dataframes as in the prior slide to illustrate how pd.concat works.

image.png

pd.concat([df_rent, df], axis = 1)
state avg_rent state gender housing height
0 Connecticut 3500.0 New York Female Own 177
1 Florida 2200.0 Florida Male Rent 179
2 California 4500.0 New York Male Rent 143
3 NaN NaN California Female Rent 178
4 NaN NaN California Male Rent 144
pd.concat([df_rent, df], axis = 0)
state avg_rent gender housing height
0 Connecticut 3500.0 NaN NaN NaN
1 Florida 2200.0 NaN NaN NaN
2 California 4500.0 NaN NaN NaN
0 New York NaN Female Own 177.0
1 Florida NaN Male Rent 179.0
2 New York NaN Male Rent 143.0
3 California NaN Female Rent 178.0
4 California NaN Male Rent 144.0

Concatenation example

We load the penguins dataset.

# Load the penguins dataset

df = sns.load_dataset('penguins')
df
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
... ... ... ... ... ... ... ...
339 Gentoo Biscoe NaN NaN NaN NaN NaN
340 Gentoo Biscoe 46.8 14.3 215.0 4850.0 Female
341 Gentoo Biscoe 50.4 15.7 222.0 5750.0 Male
342 Gentoo Biscoe 45.2 14.8 212.0 5200.0 Female
343 Gentoo Biscoe 49.9 16.1 213.0 5400.0 Male

344 rows × 7 columns

df1 = df.sample(6).reset_index(drop=True)
df2 = df.sample(4).reset_index(drop=True)
df1
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Gentoo Biscoe 45.2 15.8 215.0 5300.0 Male
1 Adelie Biscoe 37.9 18.6 172.0 3150.0 Female
2 Adelie Biscoe 37.8 18.3 174.0 3400.0 Female
3 Adelie Dream 40.6 17.2 187.0 3475.0 Male
4 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female
5 Gentoo Biscoe 46.5 13.5 210.0 4550.0 Female
df2
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
1 Gentoo Biscoe 48.4 14.4 203.0 4625.0 Female
2 Gentoo Biscoe 49.0 16.1 216.0 5550.0 Male
3 Adelie Torgersen 35.9 16.6 190.0 3050.0 Female
pd.concat([df1,df2], axis=0)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Gentoo Biscoe 45.2 15.8 215.0 5300.0 Male
1 Adelie Biscoe 37.9 18.6 172.0 3150.0 Female
2 Adelie Biscoe 37.8 18.3 174.0 3400.0 Female
3 Adelie Dream 40.6 17.2 187.0 3475.0 Male
4 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female
5 Gentoo Biscoe 46.5 13.5 210.0 4550.0 Female
0 Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
1 Gentoo Biscoe 48.4 14.4 203.0 4625.0 Female
2 Gentoo Biscoe 49.0 16.1 216.0 5550.0 Male
3 Adelie Torgersen 35.9 16.6 190.0 3050.0 Female
pd.concat([df1,df2], axis=1)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Gentoo Biscoe 45.2 15.8 215.0 5300.0 Male Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
1 Adelie Biscoe 37.9 18.6 172.0 3150.0 Female Gentoo Biscoe 48.4 14.4 203.0 4625.0 Female
2 Adelie Biscoe 37.8 18.3 174.0 3400.0 Female Gentoo Biscoe 49.0 16.1 216.0 5550.0 Male
3 Adelie Dream 40.6 17.2 187.0 3475.0 Male Adelie Torgersen 35.9 16.6 190.0 3050.0 Female
4 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female NaN NaN NaN NaN NaN NaN NaN
5 Gentoo Biscoe 46.5 13.5 210.0 4550.0 Female NaN NaN NaN NaN NaN NaN NaN
df2.index = [3,4,5,6]
pd.concat([df1,df2], axis=0)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Gentoo Biscoe 45.2 15.8 215.0 5300.0 Male
1 Adelie Biscoe 37.9 18.6 172.0 3150.0 Female
2 Adelie Biscoe 37.8 18.3 174.0 3400.0 Female
3 Adelie Dream 40.6 17.2 187.0 3475.0 Male
4 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female
5 Gentoo Biscoe 46.5 13.5 210.0 4550.0 Female
3 Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
4 Gentoo Biscoe 48.4 14.4 203.0 4625.0 Female
5 Gentoo Biscoe 49.0 16.1 216.0 5550.0 Male
6 Adelie Torgersen 35.9 16.6 190.0 3050.0 Female
pd.concat([df1,df2], axis=1)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Gentoo Biscoe 45.2 15.8 215.0 5300.0 Male NaN NaN NaN NaN NaN NaN NaN
1 Adelie Biscoe 37.9 18.6 172.0 3150.0 Female NaN NaN NaN NaN NaN NaN NaN
2 Adelie Biscoe 37.8 18.3 174.0 3400.0 Female NaN NaN NaN NaN NaN NaN NaN
3 Adelie Dream 40.6 17.2 187.0 3475.0 Male Gentoo Biscoe 46.4 15.6 221.0 5000.0 Male
4 Chinstrap Dream 50.1 17.9 190.0 3400.0 Female Gentoo Biscoe 48.4 14.4 203.0 4625.0 Female
5 Gentoo Biscoe 46.5 13.5 210.0 4550.0 Female Gentoo Biscoe 49.0 16.1 216.0 5550.0 Male
6 NaN NaN NaN NaN NaN NaN NaN Adelie Torgersen 35.9 16.6 190.0 3050.0 Female
# Notice above dataframe has columns with identical names
# For example,the column bill_depth_mm appears twice.
# If we try to select that column, all columns with that name are listed

temp = pd.concat([df1,df2], axis=1)
temp[['bill_depth_mm']]
bill_depth_mm bill_depth_mm
0 15.8 NaN
1 18.6 NaN
2 18.3 NaN
3 17.2 15.6
4 17.9 14.4
5 13.5 16.1
6 NaN 16.6

Dealing with Missing Values

Missing data takes one of two forms.
1. Entire rows of data may be missing: In such situations, you will need to think about if the remaining data set is still valuable.
- Consider if you can assess how much data is missing. If only a small portion of the data is missing, say 10%, then you may still be able to use it for meaningful analytics.
- Consider why the data is missing. If the absent data is missing at random, what you have available may still be a representative sample.
- Consider if you can re-acquire the data, or address the underlying problems and wait to collect the complete dataset.
2. Some values may be missing in the data, while others are present.
- We can remove the rows that have missing values.
- We can replace the missing values with a static default (eg, the mean, or the median).
- We can try to compute the values in a more structured way.

An example of missing data appears in the picture below. Next, we will create this dataset and artifically insert some missing values.

image.png

## We create a random dataset

np.random.seed(1)
n = 5
df = pd.DataFrame(
    {'state': list(np.random.choice(["New York", "Florida", "California"], size=(n))), 
     'gender': list(np.random.choice(["Male", "Female"], size=(n), p=[.4, .6])),
     'housing': list(np.random.choice(["Rent", "Own"], size=(n))),    
     'height': list(np.random.randint(140,200,n))
     })
## Now we loop through the data and replace a quarter of the values with NaN (`np.nan`)

for row in range(df.shape[0]):
    for col in range(df.shape[1]):
        if np.random.uniform() < 0.25:
            df.iloc[row,col] = np.nan
## Notice the `NaN` values inserted  
df
state gender housing height
0 Florida Male Rent 160.0
1 New York Male Rent 151.0
2 NaN Male Rent 182.0
3 Florida Male NaN 168.0
4 Florida Male Rent NaN

Understanding the extent of missing values

We can count the number of null values, by rows as well as columns.

In pandas, it is easy to identify null values using df.isna(). While this provides us a series of True/False Booleans, we can use the sum() command to get the total count of nulls as Booleans are also considered equal to 1 and 0 (for True and False respectively).

Using the axis parameter, we can specify whether to count missing values by rows (axis = 1) or by columns (axis = 0, the default).

  • Count of nulls for each column: df.isna().sum(axis=0)
  • Count of nulls for each row: df.isna().sum(axis=1)
## Count missing values - by columns

df.isna().sum(axis=0)

state      1
gender     0
housing    1
height     1
dtype: int64
## Count missing values - by rows

df.isna().sum(axis=1)

0    0
1    0
2    1
3    1
4    1
dtype: int64
## Count missing values - by columns, sorted

df.isna().sum(axis=0).sort_values(ascending=False)

state      1
housing    1
height     1
gender     0
dtype: int64
df.isna().sum(axis=1).sort_values(ascending=False)
2    1
3    1
4    1
0    0
1    0
dtype: int64

How to think about missing values

Sometimes, entire rows/observations or columns/features data may be missing in the data (for example, you discover that you are missing data for a city, person, year etc). If the data is not there in the first place, there is no easy programmatic way to discover the omission. You may find out about it only accidentally, or through your exploratory data analysis.

In such situations, you will need to think about if the remaining data set is still valuable.
- Consider if you can assess how much data is missing. If only a small portion of the data is missing, say 10%, then you may still be able to use it for meaningful analytics.
- Consider why the data is missing. If the absent data is missing at random, what you have available may still be a representative sample.
- Consider if you can re-acquire the data, or address the underlying problems and wait to collect the complete dataset.

Approaches
When some values in the data are missing:
1. Drop rows with nulls: If data is missing at random, and the remaining data is sufficient for us to build generalizable analytics and models.
- If data is not missing at random, and rows with missing data are dropped, this can introduce bias into our models.
3. Drop features/columns with nulls: Features that have a great deal of data missing at random can often be dropped without affecting analytical usefulness.
4. Replace with a static default: Using a summary statistic, eg mean or median, is often an easy way to replace missing values.
5. Impute missing values using more advanced methods.


Drop Missing Values

The simplest approach is to drop the rows that have a missing value. This will leave only the rows that are fully populated.

Pandas offers the function dropna() to remove rows with missing values.

You can control which rows are deleted:

  • Set a threshold n – at least n values must be missing before the row is dropped
  • Any or All – whether all values should be missing, or any missing values.
Drop rows with missing values
df
state gender housing height
0 Florida Male Rent 160.0
1 New York Male Rent 151.0
2 NaN Male Rent 182.0
3 Florida Male NaN 168.0
4 Florida Male Rent NaN
df.dropna()
state gender housing height
0 Florida Male Rent 160.0
1 New York Male Rent 151.0
Drop columns with missing values

Very similar approach as for rows, except the axis along which we evaluate deletion is vertical instead of horizontal.

Any columns that have a missing value are deleted.

df.dropna(axis = 1)
gender
0 Male
1 Male
2 Male
3 Male
4 Male
df

state gender housing height
0 Florida Male Rent 160.0
1 New York Male Rent 151.0
2 NaN Male Rent 182.0
3 Florida Male NaN 168.0
4 Florida Male Rent NaN

Fill Missing Data

Dropping rows or columns that have missing data may not always be a feasible strategy as the remainder of the dataset may become too small.

Another reason is that we may not want to throw away all the other known information just because one data point for an observation or a feature is not known.

If the data is not missing at random (for example, one sensor in the data collection apparatus was malfunctioning) and all the NaN values relate to a particular type of observation, we will introduce bias into any analytics we perform.

A viable approach in such cases may be to replace the missing values with an estimate, such as the mean, the median, or the most frequent value.

Using pd.fillna(), we can fill any holes in the data in a number of ways. With df.fillna(constant), we can replace all NaN values with a constant we specify. However, if NaNs appear in multiple columns, we may need to specify a different constant for each column.

With pd.fillna(data.mean()), we can replace NaNs with the mean, and similarly for median and other calculated measures.

## Fill missing values across the entire dataframe

df.fillna('Connecticut')
state gender housing height
0 Florida Male Rent 160.0
1 New York Male Rent 151.0
2 Connecticut Male Rent 182.0
3 Florida Male Connecticut 168.0
4 Florida Male Rent Connecticut
## Fill missing values in only a single column

df['state'].fillna('Connecticut', inplace = True)
df
state gender housing height
0 Florida Male Rent 160.0
1 New York Male Rent 151.0
2 Connecticut Male Rent 182.0
3 Florida Male NaN 168.0
4 Florida Male Rent NaN

Forward and Backward Fill

For time series data, we might like to use forward-fill (also called ‘last-observation-carried-forward’, or locf), and backward-fill (opposite of locf).

  • df.ffill: propagate last valid observation forward to next valid
  • df.bfill: use next valid observation to fill gap.
## Let us make some of the height numbers NaN
df.loc[[0,3], 'height'] = np.nan
df
state gender housing height
0 Florida Male Rent NaN
1 New York Male Rent 151.0
2 Connecticut Male Rent 182.0
3 Florida Male NaN NaN
4 Florida Male Rent NaN
# Forward fill
df.ffill()

state gender housing height
0 Florida Male Rent NaN
1 New York Male Rent 151.0
2 Connecticut Male Rent 182.0
3 Florida Male Rent 182.0
4 Florida Male Rent 182.0
# Backward fill
df.bfill()

state gender housing height
0 Florida Male Rent 151.0
1 New York Male Rent 151.0
2 Connecticut Male Rent 182.0
3 Florida Male Rent 182.0
4 Florida Male Rent 182.0
# We load some data on sales of independent winemakers
import pmdarima
df = pd.DataFrame(pmdarima.datasets.load_wineind(as_series = True), columns=['sales'])
df
sales
Jan 1980 15136.0
Feb 1980 16733.0
Mar 1980 20016.0
Apr 1980 17708.0
May 1980 18019.0
... ...
Apr 1994 26323.0
May 1994 23779.0
Jun 1994 27549.0
Jul 1994 29660.0
Aug 1994 23356.0

176 rows × 1 columns

## Now we loop through the data and replace a quarter of the values with NaN (`np.nan`)

for row in range(df.shape[0]):
    for col in range(df.shape[1]):
        if np.random.uniform() < 0.5:
            df.iloc[row,col] = np.nan
df = df[:20]
df
sales
Jan 1980 NaN
Feb 1980 NaN
Mar 1980 20016.0
Apr 1980 NaN
May 1980 NaN
Jun 1980 19227.0
Jul 1980 22893.0
Aug 1980 NaN
Sep 1980 NaN
Oct 1980 NaN
Nov 1980 NaN
Dec 1980 NaN
Jan 1981 NaN
Feb 1981 17977.0
Mar 1981 NaN
Apr 1981 21354.0
May 1981 NaN
Jun 1981 22125.0
Jul 1981 25817.0
Aug 1981 NaN
df.ffill()
C:\Users\user\AppData\Local\Temp\ipykernel_21280\1145651979.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df.fillna(method = 'ffill')
sales
Jan 1980 NaN
Feb 1980 NaN
Mar 1980 20016.0
Apr 1980 20016.0
May 1980 20016.0
Jun 1980 19227.0
Jul 1980 22893.0
Aug 1980 22893.0
Sep 1980 22893.0
Oct 1980 22893.0
Nov 1980 22893.0
Dec 1980 22893.0
Jan 1981 22893.0
Feb 1981 17977.0
Mar 1981 17977.0
Apr 1981 21354.0
May 1981 21354.0
Jun 1981 22125.0
Jul 1981 25817.0
Aug 1981 25817.0
df.bfill()
C:\Users\user\AppData\Local\Temp\ipykernel_21280\3673297803.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
  df.fillna(method = 'bfill')
sales
Jan 1980 20016.0
Feb 1980 20016.0
Mar 1980 20016.0
Apr 1980 19227.0
May 1980 19227.0
Jun 1980 19227.0
Jul 1980 22893.0
Aug 1980 17977.0
Sep 1980 17977.0
Oct 1980 17977.0
Nov 1980 17977.0
Dec 1980 17977.0
Jan 1981 17977.0
Feb 1981 17977.0
Mar 1981 21354.0
Apr 1981 21354.0
May 1981 22125.0
Jun 1981 22125.0
Jul 1981 25817.0
Aug 1981 NaN


Imputation using sklearn

Discarding entire rows or columns, or replacing information with the mean etc may work well in some situations. A more sophisticated approach may be to model the missing data, and use ML techniques to estimate the missing information.

Scikit-learn’s documentation describes multivariate feature imputation as follows:

A more sophisticated approach is to use the IterativeImputer class, which models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y. Then, the regressor is used to predict the missing values of y. This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

Source: https://scikit-learn.org/stable/modules/impute.html

The R ecosystem has several libraries that implement the MICE algorithm.

A nice write-up and graphic explaining the process is available here: https://cran.r-project.org/web/packages/miceRanger/vignettes/miceAlgorithm.html

image.png image.png image.png Sourced from cran.r-project.org

Let us get some data where we can perform some imputations. But because we are working with Python, we will not use the above, but use sklearn's imputer.

## Let us look at the mtcars dataset

import statsmodels.api as sm
df = sm.datasets.get_rdataset('mtcars').data
df
mpg cyl disp hp drat wt qsec vs am gear carb
rownames
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
## Next, we replace a quarter of the values in the data with NaNs

for row in range(df.shape[0]):
    for col in range(df.shape[1]):
        if np.random.uniform() < 0.25:
            df.iloc[row,col] = np.nan
df
mpg cyl disp hp drat wt qsec vs am gear carb
rownames
Mazda RX4 21.0 NaN 160.0 110.0 3.90 2.620 16.46 NaN NaN NaN 4.0
Mazda RX4 Wag 21.0 6.0 NaN 110.0 3.90 2.875 NaN 0.0 NaN 4.0 4.0
Datsun 710 22.8 4.0 108.0 NaN 3.85 2.320 18.61 1.0 1.0 4.0 1.0
Hornet 4 Drive NaN 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 3.0 1.0
Hornet Sportabout 18.7 8.0 360.0 NaN 3.15 3.440 17.02 0.0 NaN 3.0 2.0
Valiant 18.1 NaN 225.0 105.0 2.76 3.460 20.22 NaN 0.0 3.0 1.0
Duster 360 14.3 8.0 360.0 245.0 3.21 3.570 15.84 0.0 0.0 3.0 4.0
Merc 240D 24.4 4.0 146.7 62.0 3.69 3.190 20.00 NaN 0.0 NaN NaN
Merc 230 22.8 4.0 140.8 95.0 NaN 3.150 22.90 1.0 0.0 4.0 2.0
Merc 280 19.2 6.0 167.6 NaN 3.92 3.440 18.30 NaN 0.0 NaN 4.0
Merc 280C 17.8 6.0 167.6 NaN 3.92 3.440 NaN 1.0 NaN 4.0 NaN
Merc 450SE 16.4 NaN 275.8 180.0 3.07 4.070 NaN NaN 0.0 NaN 3.0
Merc 450SL 17.3 NaN 275.8 NaN 3.07 3.730 17.60 0.0 NaN NaN 3.0
Merc 450SLC 15.2 8.0 275.8 180.0 NaN NaN 18.00 NaN 0.0 3.0 3.0
Cadillac Fleetwood NaN 8.0 472.0 205.0 2.93 NaN 17.98 0.0 NaN 3.0 4.0
Lincoln Continental NaN 8.0 460.0 NaN NaN 5.424 17.82 0.0 0.0 3.0 4.0
Chrysler Imperial 14.7 8.0 440.0 230.0 3.23 NaN 17.42 NaN 0.0 3.0 4.0
Fiat 128 32.4 4.0 NaN 66.0 4.08 2.200 NaN NaN NaN NaN 1.0
Honda Civic 30.4 4.0 75.7 NaN 4.93 1.615 18.52 NaN NaN NaN 2.0
Toyota Corolla 33.9 4.0 71.1 NaN 4.22 1.835 19.90 NaN 1.0 4.0 1.0
Toyota Corona 21.5 NaN 120.1 97.0 3.70 NaN NaN NaN 0.0 3.0 1.0
Dodge Challenger 15.5 8.0 318.0 150.0 2.76 NaN 16.87 0.0 0.0 3.0 2.0
AMC Javelin 15.2 8.0 NaN 150.0 NaN 3.435 17.30 0.0 0.0 3.0 NaN
Camaro Z28 NaN NaN 350.0 245.0 3.73 3.840 NaN 0.0 0.0 3.0 4.0
Pontiac Firebird 19.2 8.0 400.0 175.0 3.08 3.845 17.05 0.0 NaN 3.0 2.0
Fiat X1-9 27.3 4.0 79.0 66.0 4.08 1.935 18.90 1.0 1.0 4.0 1.0
Porsche 914-2 26.0 NaN 120.3 91.0 4.43 NaN 16.70 NaN 1.0 5.0 NaN
Lotus Europa 30.4 4.0 NaN 113.0 3.77 1.513 NaN 1.0 1.0 5.0 2.0
Ford Pantera L NaN 8.0 351.0 NaN NaN NaN NaN 0.0 1.0 NaN 4.0
Ferrari Dino NaN 6.0 145.0 175.0 3.62 2.770 15.50 0.0 1.0 5.0 6.0
Maserati Bora 15.0 NaN 301.0 NaN NaN NaN 14.60 0.0 1.0 5.0 8.0
Volvo 142E 21.4 4.0 121.0 NaN NaN 2.780 18.60 1.0 1.0 4.0 2.0

Iterative Imputer (sklearn)

The Iterative Imputer models each feature with missing values as a function of other features, and uses that estimate for imputation. It does so in an iterated round-robin fashion: at each step, a feature column is designated as output y and the other feature columns are treated as inputs X. A regressor is fit on (X, y) for known y.

Then, the regressor is used to predict the missing values of y.

This is done for each feature in an iterative fashion, and then is repeated for max_iter imputation rounds. The results of the final imputation round are returned.

Source: https://scikit-learn.org/stable/modules/impute.html#iterative-imputer

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
imp = IterativeImputer(max_iter=100, random_state=0)

pd.DataFrame(imp.fit_transform(df), columns = df.columns, index = df.index).round(1)
mpg cyl disp hp drat wt qsec vs am gear carb
rownames
Mazda RX4 21.0 5.2 160.0 110.0 3.9 2.6 16.5 0.7 0.9 4.5 4.0
Mazda RX4 Wag 21.0 6.0 162.3 110.0 3.9 2.9 16.8 0.0 0.7 4.0 4.0
Datsun 710 22.8 4.0 108.0 90.1 3.8 2.3 18.6 1.0 1.0 4.0 1.0
Hornet 4 Drive 20.6 6.0 258.0 110.0 3.1 3.2 19.4 1.0 0.0 3.0 1.0
Hornet Sportabout 18.7 8.0 360.0 190.3 3.2 3.4 17.0 0.0 0.2 3.0 2.0
Valiant 18.1 5.7 225.0 105.0 2.8 3.5 20.2 0.6 0.0 3.0 1.0
Duster 360 14.3 8.0 360.0 245.0 3.2 3.6 15.8 0.0 0.0 3.0 4.0
Merc 240D 24.4 4.0 146.7 62.0 3.7 3.2 20.0 0.9 0.0 4.3 2.8
Merc 230 22.8 4.0 140.8 95.0 3.9 3.2 22.9 1.0 0.0 4.0 2.0
Merc 280 19.2 6.0 167.6 114.7 3.9 3.4 18.3 0.6 0.0 3.9 4.0
Merc 280C 17.8 6.0 167.6 115.0 3.9 3.4 18.1 1.0 0.4 4.0 4.1
Merc 450SE 16.4 6.9 275.8 180.0 3.1 4.1 19.0 0.2 0.0 3.2 3.0
Merc 450SL 17.3 6.7 275.8 157.5 3.1 3.7 17.6 0.0 0.3 3.5 3.0
Merc 450SLC 15.2 8.0 275.8 180.0 3.4 3.6 18.0 0.2 0.0 3.0 3.0
Cadillac Fleetwood 13.4 8.0 472.0 205.0 2.9 5.2 18.0 0.0 -0.2 3.0 4.0
Lincoln Continental 10.8 8.0 460.0 231.2 2.8 5.4 17.8 0.0 0.0 3.0 4.0
Chrysler Imperial 14.7 8.0 440.0 230.0 3.2 4.6 17.4 -0.2 0.0 3.0 4.0
Fiat 128 32.4 4.0 132.1 66.0 4.1 2.2 19.8 0.9 0.8 4.2 1.0
Honda Civic 30.4 4.0 75.7 76.1 4.9 1.6 18.5 0.9 1.0 4.4 2.0
Toyota Corolla 33.9 4.0 71.1 73.3 4.2 1.8 19.9 0.9 1.0 4.0 1.0
Toyota Corona 21.5 4.8 120.1 97.0 3.7 2.8 20.8 0.8 0.0 3.0 1.0
Dodge Challenger 15.5 8.0 318.0 150.0 2.8 3.7 16.9 0.0 0.0 3.0 2.0
AMC Javelin 15.2 8.0 259.5 150.0 3.5 3.4 17.3 0.0 0.0 3.0 3.0
Camaro Z28 14.4 8.2 350.0 245.0 3.7 3.8 16.9 0.0 0.0 3.0 4.0
Pontiac Firebird 19.2 8.0 400.0 175.0 3.1 3.8 17.0 0.0 0.2 3.0 2.0
Fiat X1-9 27.3 4.0 79.0 66.0 4.1 1.9 18.9 1.0 1.0 4.0 1.0
Porsche 914-2 26.0 4.7 120.3 91.0 4.4 2.4 16.7 0.8 1.0 5.0 4.5
Lotus Europa 30.4 4.0 99.9 113.0 3.8 1.5 18.1 1.0 1.0 5.0 2.0
Ford Pantera L 17.0 8.0 351.0 187.6 3.2 3.7 14.7 0.0 1.0 3.7 4.0
Ferrari Dino 18.9 6.0 145.0 175.0 3.6 2.8 15.5 0.0 1.0 5.0 6.0
Maserati Bora 15.0 7.0 301.0 168.8 3.3 4.3 14.6 0.0 1.0 5.0 8.0
Volvo 142E 21.4 4.0 121.0 95.6 3.9 2.8 18.6 1.0 1.0 4.0 2.0

Let us compare imputed results to actual results in our original data.

image.png

Not bad!!

KNN Imputer

  • The KNNImputer class provides imputation for filling in missing values using the k-Nearest Neighbors approach. By default, a euclidean distance metric that supports missing values, nan_euclidean_distances, is used to find the nearest neighbors.
  • Each missing feature is imputed using values from n_neighbors nearest neighbors that have a value for the feature.
  • The feature of the neighbors are averaged uniformly or weighted by distance to each neighbor.
  • When the number of available neighbors is less than n_neighbors and there are no defined distances to the training set, the training set average for that feature is used during imputation.
  • If there is at least one neighbor with a defined distance, the weighted or unweighted average of the remaining neighbors will be used during imputation.
  • If a feature is always missing in training, it is removed during transform.

Source: https://scikit-learn.org/stable/modules/impute.html#knnimpute

from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2, weights="uniform")
pd.DataFrame(imputer.fit_transform(df), columns = df.columns, index = df.index)
mpg cyl disp hp drat wt qsec vs am gear carb
rownames
Mazda RX4 21.00 6.0 160.0 110.0 3.330 2.6200 16.460 0.0 1.0 3.5 3.5
Mazda RX4 Wag 21.00 6.0 160.0 177.5 3.900 2.2275 17.020 0.0 1.0 4.0 4.0
Datsun 710 22.80 4.0 108.0 175.0 3.850 2.3200 18.610 1.0 1.0 3.5 3.5
Hornet 4 Drive 21.40 6.0 258.0 110.0 3.080 3.2150 19.440 1.0 0.0 3.0 1.0
Hornet Sportabout 18.70 6.0 360.0 175.0 3.150 3.4400 17.020 0.0 0.0 3.0 2.0
Valiant 20.30 6.0 154.2 105.0 2.760 3.4600 20.220 1.0 0.0 3.0 2.5
Duster 360 14.30 8.0 360.0 245.0 3.210 3.5700 15.840 0.0 0.0 3.0 4.0
Merc 240D 24.40 5.0 146.7 175.0 3.690 2.6475 20.060 1.0 0.5 4.0 4.5
Merc 230 22.80 5.0 140.8 175.0 3.920 3.1500 17.815 0.5 0.0 4.0 2.0
Merc 280 19.20 6.0 167.6 123.0 3.920 3.4400 18.300 1.0 0.0 3.5 4.0
Merc 280C 17.80 6.0 167.6 114.0 3.920 3.4500 18.900 1.0 0.0 3.5 3.5
Merc 450SE 12.80 8.0 275.8 212.5 3.070 4.0700 16.705 0.0 0.0 3.0 3.0
Merc 450SL 12.80 8.0 275.8 180.0 3.070 4.7470 17.650 0.0 0.0 3.0 3.0
Merc 450SLC 12.55 8.0 275.8 138.5 3.070 3.2675 18.000 0.0 0.0 3.0 3.0
Cadillac Fleetwood 10.40 8.0 472.0 205.0 2.930 5.2500 17.650 0.0 0.0 3.0 4.0
Lincoln Continental 10.40 8.0 373.9 225.0 3.000 5.4240 16.705 0.0 0.0 3.0 4.0
Chrysler Imperial 14.70 7.0 440.0 175.0 3.230 3.4475 17.420 0.0 0.0 3.0 2.5
Fiat 128 32.40 4.0 78.7 66.0 4.080 2.2000 19.400 1.0 1.0 4.0 1.0
Honda Civic 30.40 6.0 75.7 52.0 4.930 1.8850 18.950 1.0 1.0 4.0 2.0
Toyota Corolla 27.40 4.0 111.2 59.0 4.310 1.8350 19.900 1.0 1.0 4.0 1.5
Toyota Corona 12.80 6.0 120.1 97.0 3.700 2.4650 17.650 0.5 0.0 3.0 2.5
Dodge Challenger 15.50 8.0 318.0 175.0 2.955 3.5200 16.355 0.0 0.0 3.0 2.0
AMC Javelin 15.20 8.0 359.0 138.5 3.150 3.4350 17.300 0.0 0.0 3.0 2.0
Camaro Z28 12.95 8.0 296.9 245.0 3.035 4.7470 15.410 0.0 0.0 3.0 3.5
Pontiac Firebird 12.80 8.0 400.0 254.5 3.080 4.4295 17.050 0.0 0.0 3.0 2.0
Fiat X1-9 23.80 4.0 79.0 85.5 4.080 1.9350 18.900 1.0 1.0 4.0 1.5
Porsche 914-2 12.80 4.0 120.3 91.0 4.430 2.6350 18.600 0.0 0.5 5.0 2.0
Lotus Europa 12.80 6.0 95.1 113.0 3.770 1.5130 18.600 1.0 0.5 5.0 2.5
Ford Pantera L 15.80 8.0 296.9 264.0 4.220 3.1700 14.500 0.0 1.0 5.0 4.0
Ferrari Dino 17.40 6.0 145.0 175.0 3.620 4.4295 15.500 0.0 1.0 5.0 6.0
Maserati Bora 15.00 6.0 301.0 335.0 3.540 3.5700 14.600 0.0 1.0 5.0 8.0
Volvo 142E 21.40 4.0 121.0 109.0 4.110 2.7800 18.600 1.0 1.0 3.5 2.0

Let us compare imputed values to actual data.

image.png

This is even better than the iterative imputer!!

List Comprehension and Other Useful Tricks

List comprehension returns a list, and takes the following format:

[ function(item) for item in iterable if condition ]

# Create an empty dataframe

import pandas as pd

df = pd.DataFrame(columns = ['Date', 'User 1', 'User 2', 'User 3', 'User 4', 'User 5', 'User 6', 'User 7'])
df

Date User 1 User 2 User 3 User 4 User 5 User 6 User 7
# List all columns in a dataframe meeting a criteria

[col for col in df if col.startswith('U')]
['User 1', 'User 2', 'User 3', 'User 4', 'User 5', 'User 6', 'User 7']
# If condition in a single line

b = 4
a = "positive" if b >= 0 else "negative"
a
'positive'

List comprehension
newlist = [expression for item in iterable if condition == True]

# Basic list comprehension
x = list(v**2 for v in range(4))
x
[0, 1, 4, 9]
# List comprehension

fruits = ["apple", "banana", "cherry", "kiwi", "mango"]

newlist = [x for x in fruits if "a" in x]

print(newlist)
['apple', 'banana', 'mango']
# Subsetting a dict

samples = {k: v for k, v in samples.items() if k not in ["idx", "sentence1", "sentence2"]}
# A function to identify text in a string
def corona(text):
    corona_story_strings = ['covid', 'corona', 'sars', 'virus', 'coronavirus', 'vaccine']
    return any(x in text for x in corona_story_strings)