Encoding Categorical Variables in Machine Learning
Categorical variables are data that represent labels or categories, such as colors (Red, Blue, Green) or cities (Delhi, Mumbai, Kolkata). Machine learning models cannot process categorical data directly, so we need to convert them into numerical values. This process is called encoding.
Types of Categorical Encoding
- Label Encoding
- One-Hot Encoding
- Ordinal Encoding
- Binary Encoding
- Frequency Encoding
- Target Encoding
1. Label Encoding
Label Encoding assigns each unique category a numerical value. It is simple but can create a false sense of order.
Example :
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Creating a dataset
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
# Applying Label Encoding
encoder = LabelEncoder()
df['Color_Encoded'] = encoder.fit_transform(df['Color'])
print(df)
Output :
| Index | Color | Color_Encoded |
|---|---|---|
| 0 | Red | 2 |
| 1 | Blue | 0 |
| 2 | Green | 1 |
| 3 | Blue | 0 |
| 4 | Red | 2 |
2. One-Hot Encoding
One-hot encoding converts categorical values into separate binary columns (0s and 1s). Each category gets its own column, and the value is 1 where the category exists, otherwise 0.
Example:
import pandas as pd
# Creating a dataset
df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']})
# Applying One-Hot Encoding
df_encoded = pd.get_dummies(df, columns=['Color'])
print(df_encoded)
Output (After One-Hot Encoding):
| Index | Color_Blue | Color_Green | Color_Red |
|---|---|---|---|
| 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 |
| 2 | 0 | 1 | 0 |
| 3 | 1 | 0 | 0 |
| 4 | 0 | 0 | 1 |
3. Ordinal Encoding
Ordinal encoding is a technique used to convert categorical variables into numerical values while preserving their order or ranking. Unlike label encoding, ordinal encoding is used when the categories have a logical sequence but no fixed numerical difference between them.
Example of Ordinal Categories:
- Size: Small < Medium < Large
- Education Level: High School < Bachelor’s < Master’s < PhD
- Customer Satisfaction: Poor < Average < Good < Excellent
In ordinal encoding, each category is assigned a numerical value in ascending order. For example:
- Size:
- Small → 0
- Medium → 1
- Large → 2
Example :
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({'Size': ['Small', 'Medium', 'Large', 'Medium', 'Small']})
# Defining order
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
df['Size_Encoded'] = encoder.fit_transform(df[['Size']])
print(df)
Output :
| Index | Size | Size_Encoded |
|---|---|---|
| 0 | Small | 0 |
| 1 | Medium | 1 |
| 2 | Large | 2 |
| 3 | Medium | 1 |
| 4 | Small | 0 |
4. Binary Encoding
Binary encoding is a categorical encoding technique that first converts categories into numerical values and then represents them in binary format. Each binary digit is stored in a separate column, reducing dimensionality compared to One-Hot Encoding.
How Binary Encoding Works?
- Convert categories into numbers
- Example: "Apple" = 1, "Banana" = 2, "Cherry" = 3
- Convert numbers into binary format
- 1 → 001
- 2 → 010
- 3 → 011
- Split binary digits into separate columns
- The number of columns depends on the number of bits required to represent the highest category.
Advantages of Binary Encoding:
- Reduces dimensionality compared to One-Hot Encoding.
- Works well with high-cardinality categorical data (many unique categories).
- Helps in reducing collinearity between encoded variables.
Example:
import category_encoders as ce
df = pd.DataFrame({'City': ['Delhi', 'Mumbai', 'Kolkata', 'Delhi', 'Mumbai']})
# Applying Binary Encoding
encoder = ce.BinaryEncoder(cols=['City'])
df = encoder.fit_transform(df)
print(df)
Output Table:
| Index | City_0 | City_1 |
|---|---|---|
| 0 | 0 | 1 |
| 1 | 1 | 0 |
| 2 | 1 | 1 |
| 3 | 0 | 1 |
| 4 | 1 | 0 |
5. Frequency Encoding
Frequency Encoding (also called Count Encoding) is a technique that replaces each category with its frequency of occurrence in the dataset. Instead of assigning arbitrary numbers, it uses the proportion of times a category appears as its value.
How Frequency Encoding Works?
-
Count the occurrences of each category.
-
Divide by the total number of observations to get the relative frequency.
-
Replace the category with this frequency value.
Example:
If we have a "Fruit" column with values:
-
Apple (3 times)
-
Banana (2 times)
-
Orange (1 time)
And the total dataset has 6 rows, then:
-
Apple → 3/6 = 0.50
-
Banana → 2/6 = 0.33
-
Orange → 1/6 = 0.17
Advantages of Frequency Encoding:
-
Works well with high-cardinality categorical variables (many unique categories).
-
Reduces dimensionality compared to One-Hot Encoding.
-
Keeps some information about category importance.
Limitations:
-
May introduce data leakage if applied before dataset splitting.
-
Can cause misinterpretation if categories have the same frequency but different meanings.
6. Target Encoding
What is Target Encoding?
Target Encoding replaces categorical values with the mean of the target variable for each category. Instead of assigning arbitrary numbers, it uses information from the dependent variable (target) to encode categorical features.
How Target Encoding Works?
- Group data by the categorical column.
- Calculate the mean value of the target variable for each category.
- Replace each category with this calculated mean.
Example:
If we have a dataset with "City" and "Sales" as columns:
| City | Sales |
|---|---|
| Delhi | 200 |
| Mumbai | 150 |
| Kolkata | 180 |
| Delhi | 210 |
| Mumbai | 160 |
-
Mean Sales for each City:
- Delhi → (200 + 210) / 2 = 205.0
- Mumbai → (150 + 160) / 2 = 155.0
- Kolkata → (180) = 180.0
-
After target encoding, "City" is replaced with:
| City | Sales | City_Encoded |
|---|---|---|
| Delhi | 200 | 205.0 |
| Mumbai | 150 | 155.0 |
| Kolkata | 180 | 180.0 |
| Delhi | 210 | 205.0 |
| Mumbai | 160 | 155.0 |