Pandas Functions for Data Cleaning
Pandas provides various functions to clean and preprocess data, ensuring accuracy and consistency before analysis or machine learning. Below, each function is explained with attributes, examples, and tabular representations where necessary.
Handling Missing Values
1. df.isnull()
/ df.notnull()
Definition
df.isnull()
returnsTrue
for missing (NaN
) values andFalse
for non-missing values.df.notnull()
does the opposite, returningTrue
for non-missing values.
Example
import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', None],
'Age': [25, None, 30, 28]}
df = pd.DataFrame(data)
print(df.isnull()) # Identifies missing values
Output
Name | Age |
---|---|
False | False |
False | True |
False | False |
True | False |
2. df.dropna(axis, how, thresh, subset, inplace)
Definition
- Removes rows or columns with missing values.
axis=0
→ Drops rows (default).axis=1
→ Drops columns.how='any'
→ Drops if any value is missing.how='all'
→ Drops if all values are missing.thresh=n
→ Drops rows/columns with less thann
non-null values.subset=[column_names]
→ Drops based on specific columns.inplace=True
→ Modifies the DataFrame without returning a new one.
Example
df.dropna(axis=0, how='any', inplace=True)
Removes all rows containing at least one NaN
.
3. df.fillna(value, method, axis, inplace)
Definition
- Fills missing values with specified data.
value
→ Fixed value or computed statistic (e.g., mean, median).method='ffill'
→ Forward fills values from previous rows.method='bfill'
→ Backward fills values from next rows.axis=0
→ Fills column-wise.axis=1
→ Fills row-wise.inplace=True
modifies the DataFrame directly without creating a new copy.
Example
df.fillna(df.mean(), inplace=True)
df.mean()
- Computes the mean (average) of each numerical column in the DataFrame.
- Returns a series where each column’s mean value is used for filling its missing values.
Example:
A | B | C |
---|---|---|
10 | NaN | 5 |
20 | 30 | NaN |
NaN | 40 | 15 |
After applying df.fillna(df.mean(), inplace=True)
, the missing values will be replaced as follows:
A | B | C |
---|---|---|
10 | 35 | 5 |
20 | 30 | 10 |
15 | 40 | 15 |
Here:
- Mean of A = (10 + 20) / 2 = 15 → replaces
NaN
in column A. - Mean of B = (30 + 40) / 2 = 35 → replaces
NaN
in column B. - Mean of C = (5 + 15) / 2 = 10 → replaces
NaN
in column C.
4. df.interpolate(method='linear')
- Estimates missing values based on surrounding values.
- Useful for time-series or numerical data.
Example
import pandas as pd
# Creating a DataFrame with missing values
data = {'Date': pd.date_range(start='2024-03-01', periods=6, freq='D'),
'Temperature': [22, 24, None, 28, None, 30]}
df = pd.DataFrame(data)
# Applying linear interpolation
df['Temperature'] = df['Temperature'].interpolate(method='linear')
print(df)
Output (Before and After Interpolation)
Date | Temperature (Before) | Temperature (After) |
---|---|---|
2024-03-01 | 22 | 22 |
2024-03-02 | 24 | 24 |
2024-03-03 | NaN | 26 |
2024-03-04 | 28 | 28 |
2024-03-05 | NaN | 29 |
2024-03-06 | 30 | 30 |
- The missing value on 2024-03-03 is estimated as 26 (midpoint between 24 and 28).
- The missing value on 2024-03-05 is estimated as 29 (midpoint between 28 and 30).
- The interpolation assumes a straight-line progression between available values.
Handling Duplicates
1. df.duplicated(subset, keep)
Definition
- Identifies duplicate rows in a DataFrame.
subset
→ Columns to check for duplicates (default is all).keep='first'
→ Marks all but the first occurrence as duplicate.keep='last'
→ Marks all but the last occurrence.keep=False
→ Marks all occurrences as duplicate.
Example
df.duplicated()
2. df.drop_duplicates(subset, keep, inplace)
Definition
- Removes duplicate rows from a DataFrame.
- Parameters are the same as
df.duplicated()
.
Example
df.drop_duplicates(inplace=True)
Removes all duplicate rows, keeping only the first occurrence.
Handling Outliers
Outliers are extreme values that deviate significantly from the rest of the dataset. The Interquartile Range (IQR) method is used to detect and remove them.
1. df.describe()
Definition
- Generates summary statistics to identify outliers.
- Includes mean, min, max, standard deviation, quartiles.
Example
df.describe()
2. Interquartile Range (IQR) Outlier Removal
Example
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))]
Removes outliers using the IQR method.
Example:
import pandas as pd
import numpy as np
# Creating a dataset with an outlier
data = {'Scores': [50, 55, 53, 52, 1000, 51, 49, 54, 48, 50]} # 1000 is an outlier
df = pd.DataFrame(data)
# Compute Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Scores'].quantile(0.25)
Q3 = df['Scores'].quantile(0.75)
# Compute IQR
IQR = Q3 - Q1
# Define lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Detecting outliers
outliers = df[(df['Scores'] < lower_bound) | (df['Scores'] > upper_bound)]
print("Outliers detected:\n", outliers)
# Removing outliers
df_cleaned = df[(df['Scores'] >= lower_bound) & (df['Scores'] <= upper_bound)]
print("\nData after removing outliers:\n", df_cleaned)
Output:
Outliers Detected
Scores |
---|
1000 |
Data After Removing Outliers
Scores |
---|
50 |
55 |
53 |
52 |
51 |
49 |
54 |
48 |
50 |
Standardizing and Normalizing Data
1. df['column'].str.lower()
Definition
- Converts text data to lowercase for uniformity.
Example
df['Name'] = df['Name'].str.lower()
2. df['column'] = df['column'].astype(dtype)
Definition
- Changes data type of a column.
Example
df['Age'] = df['Age'].astype(int)
Encoding Categorical Data
1. pd.get_dummies(df['column'])
Definition
- Performs One-Hot Encoding, converting categorical values into binary columns.
Example
pd.get_dummies(df['Category'])
2. df['column'].astype('category').cat.codes
Definition
- Encodes categories into numerical values.
Example
import pandas as pd
# Creating a DataFrame with categorical values
data = {'Category': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple', 'Orange']}
df = pd.DataFrame(data)
# Converting categorical column to numerical codes
df['Category'] = df['Category'].astype('category').cat.codes
print(df)
Output:
Index | Category (Before) | Category (After) |
---|---|---|
0 | Apple | 0 |
1 | Banana | 1 |
2 | Orange | 2 |
3 | Banana | 1 |
4 | Apple | 0 |
5 | Orange | 2 |