Encoding Categorical Data
Label-Encoding and One-Hot-Encoding in Python

Hi everyone! Hanh here. I'm a data scientist with a master degree in business and informatics. I am also interested in software development and web development, because I love to build beautiful and functional things. I am learning and would like to share what I learn.
Encoding categorical variables is an important step in preprocessing for buillding a statistical model in data science. Because most of the machine learning algoriths donot work (well) with categorical data directly but rather in numeric form. The categorical data could be feedback type (very poor, poor, satisfactory, good, very good) or regions (Europe, Asia, Americas, Oceania, Africa). The process to convert a categorical variable into numeric variable is called encoding.
There are many ways of categorical encoding, but the most popular ways are label encoding and one-hot encoding, which I will focus on this article. First, let's generate some data to use on both ways.
from sklearn import preprocessing
import pandas as pd
# initialize list of lists
data = [['male', 40, 'Basic'], ['female', 30, 'Master'], [
'female', 22, 'Bachelor'], ['male', 60, 'Phd'], ['female', 44, 'Bachelor']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns=['Gender', 'Age', 'Education'])
# print dataframe.
print(df)
print(df.dtypes)
Gender Age Education
0 male 40 Basic
1 female 30 Master
2 female 22 Bachelor
3 male 60 Phd
4 female 44 Bachelor
Gender object
Age int64
Education object
1. Label Encoding
Label encoding simply converts each unique wording label into numerical number based on alphabetical order. There are two ways you can achieve this by:
- approach 1: using category codes in pandas
- approach 2: using class LabelEncoder from scikit-learn library
Approach 1: Category Codes
# Convert object type to category type
df['Education'] = df['Education'].astype('category')
df['Education_C'] = df['Education'].cat.codes
print(df.head())
Gender Age Education Education_C
0 male 40 Basic 1
1 female 30 Master 2
2 female 22 Bachelor 0
3 male 60 Phd 3
4 female 44 Bachelor 0
Approach 2: LabelEncoder
# instantiate LabelEncoder()
label_encoder = preprocessing.LabelEncoder()
# Encode labels in column 'Education'.
df['Education_N'] = label_encoder.fit_transform(df['Education'])
print(df.head())
Gender Age Education Education_C Education_N
0 male 40 Basic 1 1
1 female 30 Master 2 2
2 female 22 Bachelor 0 0
3 male 60 Phd 3 3
4 female 44 Bachelor 0 0
We see for both approaches, we have the same conversion of label to numerical form. However, there is no relationship between categories themselves, but label encoding introduces an order relationship. Will the model associate the higher value of label as better and so on?
2. One-Hot Encoding
With one hot encoding we introduce a number of new dummy variables with binary encoding (0 and 1) to address if any label appears in which row (1) or not (0). After encoding, we will see each unique category of the feature 'Education' represented as a separate feature. The same as label encoding, we can generate one hot encoding by two methods:
- approach 1: using get_dummies from pandas
- approach 2: using class OneHotEncoder from scikit-klearn library
# 2.1 Approach 1: get_dummies
dummy_df = pd.get_dummies(df['Education'])
print(dummy_df)
Bachelor Basic Master Phd
0 0 1 0 0
1 0 0 1 0
2 1 0 0 0
3 0 0 0 1
4 1 0 0 0
# 2.2 Approach 2: OneHotEncoder()
# instantiate OneHotEncoder()
onehotencoder = preprocessing.OneHotEncoder()
# encode labels in column 'Education'
one_hot_df = onehotencoder.fit_transform(df['Education]).toarray()
One Hot Encoding overcomes the shortcoming of LabelEncoding in which it preserves the nature of no ordinal ranking amongst labels. However, its disadvantage lies at the high number of additional columns, which in turn causes the 'curse of dimensionality' as well as the memory consumption.



