ADVERTISEMENT
ADVERTISEMENT

Pandas Functions for Data Cleaning

Pandas provides various functions to clean and preprocess data, ensuring accuracy and consistency before analysis or machine learning. Below, each function is explained with attributes, examples, and tabular representations where necessary.

Handling Missing Values

1. df.isnull() / df.notnull()

Definition

  • df.isnull() returns True for missing (NaN) values and False for non-missing values.
  • df.notnull() does the opposite, returning True for non-missing values.

Example

import pandas as pd

data = {'Name': ['Alice', 'Bob', 'Charlie', None],
        'Age': [25, None, 30, 28]}

df = pd.DataFrame(data)
print(df.isnull())  # Identifies missing values

Output

Name Age
False False
False True
False False
True False

2. df.dropna(axis, how, thresh, subset, inplace)

Definition

  • Removes rows or columns with missing values.
  • axis=0 → Drops rows (default).
  • axis=1 → Drops columns.
  • how='any' → Drops if any value is missing.
  • how='all' → Drops if all values are missing.
  • thresh=n → Drops rows/columns with less than n non-null values.
  • subset=[column_names] → Drops based on specific columns.
  • inplace=True → Modifies the DataFrame without returning a new one.

Example

df.dropna(axis=0, how='any', inplace=True)

Removes all rows containing at least one NaN.

3. df.fillna(value, method, axis, inplace)

Definition

  • Fills missing values with specified data.
  • value → Fixed value or computed statistic (e.g., mean, median).
  • method='ffill' → Forward fills values from previous rows.
  • method='bfill' → Backward fills values from next rows.
  • axis=0 → Fills column-wise.
  • axis=1 → Fills row-wise.
  • inplace=True modifies the DataFrame directly without creating a new copy.

Example

df.fillna(df.mean(), inplace=True)

df.mean()

  • Computes the mean (average) of each numerical column in the DataFrame.
  • Returns a series where each column’s mean value is used for filling its missing values.

Example:

A B C
10 NaN 5
20 30 NaN
NaN 40 15

After applying df.fillna(df.mean(), inplace=True), the missing values will be replaced as follows:

A B C
10 35 5
20 30 10
15 40 15

Here:

  • Mean of A = (10 + 20) / 2 = 15 → replaces NaN in column A.
  • Mean of B = (30 + 40) / 2 = 35 → replaces NaN in column B.
  • Mean of C = (5 + 15) / 2 = 10 → replaces NaN in column C.

4. df.interpolate(method='linear')

  • Estimates missing values based on surrounding values.
  • Useful for time-series or numerical data.

 Example

import pandas as pd

# Creating a DataFrame with missing values
data = {'Date': pd.date_range(start='2024-03-01', periods=6, freq='D'),
        'Temperature': [22, 24, None, 28, None, 30]}

df = pd.DataFrame(data)

# Applying linear interpolation
df['Temperature'] = df['Temperature'].interpolate(method='linear')

print(df)

Output (Before and After Interpolation)

Date Temperature (Before) Temperature (After)
2024-03-01 22 22
2024-03-02 24 24
2024-03-03 NaN 26
2024-03-04 28 28
2024-03-05 NaN 29
2024-03-06 30 30
  • The missing value on 2024-03-03 is estimated as 26 (midpoint between 24 and 28).
  • The missing value on 2024-03-05 is estimated as 29 (midpoint between 28 and 30).
  • The interpolation assumes a straight-line progression between available values.

Handling Duplicates

1. df.duplicated(subset, keep)

Definition

  • Identifies duplicate rows in a DataFrame.
  • subset → Columns to check for duplicates (default is all).
  • keep='first' → Marks all but the first occurrence as duplicate.
  • keep='last' → Marks all but the last occurrence.
  • keep=False → Marks all occurrences as duplicate.

Example

df.duplicated()

2. df.drop_duplicates(subset, keep, inplace)

Definition

  • Removes duplicate rows from a DataFrame.
  • Parameters are the same as df.duplicated().

Example

df.drop_duplicates(inplace=True)

Removes all duplicate rows, keeping only the first occurrence.

Handling Outliers

Outliers are extreme values that deviate significantly from the rest of the dataset. The Interquartile Range (IQR) method is used to detect and remove them.

1. df.describe()

Definition

  • Generates summary statistics to identify outliers.
  • Includes mean, min, max, standard deviation, quartiles.

Example

df.describe()

2. Interquartile Range (IQR) Outlier Removal

Example

Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1

df = df[~((df < (Q1 - 1.5 * IQR)) | (df > (Q3 + 1.5 * IQR)))]
Removes outliers using the IQR method.

Example:

import pandas as pd
import numpy as np

# Creating a dataset with an outlier
data = {'Scores': [50, 55, 53, 52, 1000, 51, 49, 54, 48, 50]}  # 1000 is an outlier
df = pd.DataFrame(data)

# Compute Q1 (25th percentile) and Q3 (75th percentile)
Q1 = df['Scores'].quantile(0.25)
Q3 = df['Scores'].quantile(0.75)

# Compute IQR
IQR = Q3 - Q1

# Define lower and upper bounds
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Detecting outliers
outliers = df[(df['Scores'] < lower_bound) | (df['Scores'] > upper_bound)]
print("Outliers detected:\n", outliers)

# Removing outliers
df_cleaned = df[(df['Scores'] >= lower_bound) & (df['Scores'] <= upper_bound)]
print("\nData after removing outliers:\n", df_cleaned)

Output:

Outliers Detected

Scores
1000

Data After Removing Outliers

Scores
50
55
53
52
51
49
54
48
50

Standardizing and Normalizing Data

1. df['column'].str.lower()

Definition

  • Converts text data to lowercase for uniformity.

Example

df['Name'] = df['Name'].str.lower()

2. df['column'] = df['column'].astype(dtype)

Definition

  • Changes data type of a column.

Example

df['Age'] = df['Age'].astype(int)

Encoding Categorical Data

1. pd.get_dummies(df['column'])

Definition

  • Performs One-Hot Encoding, converting categorical values into binary columns.

Example

pd.get_dummies(df['Category'])

2. df['column'].astype('category').cat.codes

Definition

  • Encodes categories into numerical values.

Example

import pandas as pd

# Creating a DataFrame with categorical values
data = {'Category': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple', 'Orange']}
df = pd.DataFrame(data)

# Converting categorical column to numerical codes
df['Category'] = df['Category'].astype('category').cat.codes

print(df)

Output:

Index Category (Before) Category (After)
0 Apple 0
1 Banana 1
2 Orange 2
3 Banana 1
4 Apple 0
5 Orange 2

ADVERTISEMENT

ADVERTISEMENT