ADVERTISEMENT
ADVERTISEMENT

Encoding Categorical Variables in Machine Learning

Categorical variables are data that represent labels or categories, such as colors (Red, Blue, Green) or cities (Delhi, Mumbai, Kolkata). Machine learning models cannot process categorical data directly, so we need to convert them into numerical values. This process is called encoding.

Types of Categorical Encoding

  1. Label Encoding
  2. One-Hot Encoding
  3. Ordinal Encoding
  4. Binary Encoding
  5. Frequency Encoding
  6. Target Encoding

1. Label Encoding

Label Encoding assigns each unique category a numerical value. It is simple but can create a false sense of order.

Example :

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating a dataset
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})

# Applying Label Encoding
encoder = LabelEncoder()
df['Color_Encoded'] = encoder.fit_transform(df['Color'])

print(df)

 Output :

Index Color Color_Encoded
0 Red 2
1 Blue 0
2 Green 1
3 Blue 0
4 Red 2

2. One-Hot Encoding

One-hot encoding converts categorical values into separate binary columns (0s and 1s). Each category gets its own column, and the value is 1 where the category exists, otherwise 0.

Example:

import pandas as pd

# Creating a dataset
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})


# Applying One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)

Output (After One-Hot Encoding):

Index Color_Blue Color_Green Color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 1 0 0
4 0 0 1

3. Ordinal Encoding

Ordinal encoding is a technique used to convert categorical variables into numerical values while preserving their order or ranking. Unlike label encoding, ordinal encoding is used when the categories have a logical sequence but no fixed numerical difference between them.

Example of Ordinal Categories:

  • Size: Small < Medium < Large
  • Education Level: High School < Bachelor’s < Master’s < PhD
  • Customer Satisfaction: Poor < Average < Good < Excellent

In ordinal encoding, each category is assigned a numerical value in ascending order. For example:

  • Size:
    • Small → 0
    • Medium → 1
    • Large → 2

Example :

from sklearn.preprocessing import OrdinalEncoder

df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})

# Defining order
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_Encoded'] = encoder.fit_transform(df[['Size']])

print(df)

Output :

Index Size Size_Encoded
0 Small 0
1 Medium 1
2 Large 2
3 Medium 1
4 Small 0

4. Binary Encoding

Binary encoding is a categorical encoding technique that first converts categories into numerical values and then represents them in binary format. Each binary digit is stored in a separate column, reducing dimensionality compared to One-Hot Encoding.

How Binary Encoding Works?

  1. Convert categories into numbers
    • Example: "Apple" = 1, "Banana" = 2, "Cherry" = 3
  2. Convert numbers into binary format
    • 1 → 001
    • 2 → 010
    • 3 → 011
  3. Split binary digits into separate columns
    • The number of columns depends on the number of bits required to represent the highest category.

Advantages of Binary Encoding:

  • Reduces dimensionality compared to One-Hot Encoding.
  • Works well with high-cardinality categorical data (many unique categories).
  • Helps in reducing collinearity between encoded variables.

Example:

import category_encoders as ce

df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata', 'Delhi', 'Mumbai']})

# Applying Binary Encoding
encoder = ce.BinaryEncoder(cols=['City'])
df = encoder.fit_transform(df)

print(df)

Output Table:

Index City_0 City_1
0 0 1
1 1 0
2 1 1
3 0 1
4 1 0

5. Frequency Encoding

Frequency Encoding (also called Count Encoding) is a technique that replaces each category with its frequency of occurrence in the dataset. Instead of assigning arbitrary numbers, it uses the proportion of times a category appears as its value.

How Frequency Encoding Works?

  1. Count the occurrences of each category.

  2. Divide by the total number of observations to get the relative frequency.

  3. Replace the category with this frequency value.

Example:

If we have a "Fruit" column with values:

  • Apple (3 times)

  • Banana (2 times)

  • Orange (1 time)

And the total dataset has 6 rows, then:

  • Apple → 3/6 = 0.50

  • Banana → 2/6 = 0.33

  • Orange → 1/6 = 0.17

Advantages of Frequency Encoding:

  • Works well with high-cardinality categorical variables (many unique categories).

  • Reduces dimensionality compared to One-Hot Encoding.

  • Keeps some information about category importance.

Limitations:

  • May introduce data leakage if applied before dataset splitting.

  • Can cause misinterpretation if categories have the same frequency but different meanings.

6. Target Encoding 

What is Target Encoding?

Target Encoding replaces categorical values with the mean of the target variable for each category. Instead of assigning arbitrary numbers, it uses information from the dependent variable (target) to encode categorical features.

How Target Encoding Works?

  1. Group data by the categorical column.
  2. Calculate the mean value of the target variable for each category.
  3. Replace each category with this calculated mean.

Example:

If we have a dataset with "City" and "Sales" as columns:

City Sales
Delhi 200
Mumbai 150
Kolkata 180
Delhi 210
Mumbai 160
  • Mean Sales for each City:

    • Delhi → (200 + 210) / 2 = 205.0
    • Mumbai → (150 + 160) / 2 = 155.0
    • Kolkata → (180) = 180.0
  • After target encoding, "City" is replaced with:

City Sales City_Encoded
Delhi 200 205.0
Mumbai 150 155.0
Kolkata 180 180.0
Delhi 210 205.0
Mumbai 160 155.0

ADVERTISEMENT

ADVERTISEMENT