In this tutorial we will discuss label encoding in Python.
Table of Contents
- Introduction
- Label encoding explained
- Advantages and disadvantages of label encoding
- Label encoding in Python
- Conclusion
Introduction
In data science, we often work with datasets that contain categorical variables, where the values are represented by strings. For example, when we work with datasets for salary estimation based on different sets of features, we often see job title being entered in words, for example: Manager, Director, Vice-President, President, and so on. The complication it creates is the fact that machine learning algorithms in fact can work with categorical features, yet they have to be in numeric form.
There are multiple ways to solve this problem and a lot depends on the algorithm you will be working with. And how sensitive it is to the ranges and distributions of numerical features.
Two of the most common approaches are:
- Label Encoding
- One-Hot Encoding
Both techniques allow for conversion from categorical/text data to numeric format. These are valid solutions with their own benefits and costs. In this article we will focus on label encoding and it’s variations. We will also outline cases when it should/shouldn’t be applied.
To continue following this tutorial we will need the following two Python libraries: sklearn and pandas. If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install sklearn
pip install pandas
Label encoding explained
To get a sense how label encoding works, let’s take a look at the following dataset:
$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} \\
\hline
\text{Customer Service} & 44,000 \\
\text{Manager} & 75,000 \\
\text{Assistant Manager} & 65,000 \\
\text{Director} & 90,000 \\
\end{array}
\end{matrix}
$$
Assume it is the data that we would like to feed into some machine learning algorithm. Every row represents a position that an individual holds and the corresponding annual salary.
The “Position” feature is all text and it is what we will need to convert into model-friendly numeric format.
The question that arises is how do we assign numeric values to text categorical data?
Note: for the purposes of this article consider the range of the numbers we can assign between 0 and \(+\infty\) with 0 being the smallest number.
Here are a few possible ways we can assign numeric values to the “Position” feature:
Option 1: Using current order
$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} & \text{code} \\
\hline
\text{Customer Service} & 44,000 & 0\\
\text{Manager} & 75,000 & 1 \\
\text{Assistant Manager} & 65,000 & 2 \\
\text{Director} & 90,000 & 3 \\
\end{array}
\end{matrix}
$$
Option 2: Using alphabetical order
$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} & \text{code} \\
\hline
\text{Customer Service} & 44,000 & 1 \\
\text{Manager} & 75,000 & 3\\
\text{Assistant Manager} & 65,000 & 0 \\
\text{Director} & 90,000 & 2\\
\end{array}
\end{matrix}
$$
Option 3: Using “Salary” feature order
$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} & \text{code} \\
\hline
\text{Customer Service} & 44,000 & 0 \\
\text{Manager} & 75,000 & 2 \\
\text{Assistant Manager} & 65,000 & 1 \\
\text{Director} & 90,000 & 3 \\
\end{array}
\end{matrix}
$$
Which one is correct to use?
Well there is no definite answer. It all depends. Depends on the algorithm you are going to feed these features to. It also really depends on your dataset.
For example, if you are going to use simple linear regression (OLS) to estimate an individual’s salary as a function of their position, you should only use option 3. Here is why: when you convert this feature to a numerical format, the algorithm doesn’t understand the structure of your hierarchy. It now treats everything as numbers and it safely assumes the following: 0<1<2<3.
If you followed option 1, then the algorithm will see the position of “Assistant Manager” being superior to “Manager”.
If you followed option 2, then the algorithm will see the position of “Customer Service” being superior to “Assistant Manager”.
This is not what you want to have right? As it can provide misleading estimates and cause false predictions (which can potentially be statistically significant) but only because the algorithm treats the numeric feature sequentially from smallest to largest.
Now, the interesting part comes when you decide to implement this in python. There are a handful of ways to achieve the same result, and we will discuss a few of them below.
Advantages and disadvantages of label encoding
It is important to understand the benefits and drawbacks of label encoding and also consider other available encoding techniques.
Advantages:
- It is easy to implement and interpret
- It is visually user friendly
- It works best with smaller number of unique categorical values
Disadvantages:
- It can skew the estimation results if an algorithm is very sensitive to feature magnitude (like SVM). In such case you may consider standardizing or normalizing values after encoding.
- It can skew the estimation results if there is a large number of unique categorical values. In our case it was 4, but if it’s 10 or more, you should keep this in mind. In such case you should look into other encoding techniques, for example, one hot encoding.
Label Encoding in Python
In this part we will cover a few different ways of how to do label encoding in Python.
Two of the most popular approaches:
- LabelEncoder() from scikit-learn library
- pandas.factorize() from pandas library
Once the libraries are downloaded and installed, we can proceed with Python code implementation.
Step 1: Create a dataframe with the required data
import pandas as pd
df = {'Position': ['Customer Service','Manager','Assistant Manager','Director'],
'Salary': [44000,75000,65000,90000]
}
df = pd.DataFrame(df)
First we import pandas library as it will be required to create a pandas dataframe. Then we create a Python dictionary df and convert it to dataframe.
Let’s take a look at the result:
print(df)
Output:
Position Salary
0 Customer Service 44000
1 Manager 75000
2 Assistant Manager 65000
3 Director 90000
Step 2.1: Label encoding in Python using current order
df['code'] = pd.factorize(df['Position'])[0]
We create a new feature “code” and assign categorical feature “position” in numerical format to it.
The sequence of numbers in “code” by default follows the order of the original dataframe df:
print(df)
Output:
Position Salary code
0 Customer Service 44000 0
1 Manager 75000 1
2 Assistant Manager 65000 2
3 Director 90000 3
Step 2.2: Label encoding in Python using alphabetical order
This case is a little more interesting as we can achieve the same result using both of the methods mentioned earlier.
scikit-learn method
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['code']= le.fit_transform(df['Position'])
We will first import LabelEncoder() from sci-kit learn library and define le as its instance. Then we will apply it to the “Position” feature to convert it to numerical format and store as a new feature “code“.
What’s interesting about this method is that by default LabelEncoder() orders values in alphabetical order without us having to specify anything.
Let’s take a look what we arrived at:
print(df)
Output:
Position Salary code
0 Customer Service 44000 1
1 Manager 75000 3
2 Assistant Manager 65000 0
3 Director 90000 2
LabelEncoder() correctly order the values in “Position” feature and generated the corresponding numerical values in the following sequence: Assistant Manager, Customer Service, Director, Manager.
pandas method
df['code'] = pd.factorize(df['Position'], sort=True)[0]
What’s different from Step 2.1 where we worked with original order, we added “sort=True” (alphabetically) parameter to identify that we need the conversion to numerical format of the sorted “Position” feature.
Let’s take a look at the result:
print(df)
Output:
Position Salary code
0 Customer Service 44000 1
1 Manager 75000 3
2 Assistant Manager 65000 0
3 Director 90000 2
We can see that both scikit-learn method and pandas method generate the same result.
Step 2.3: Label encoding in Python using “Salary” feature order
As we discussed in the Understanding Label Encoding section, most likely this will be the most algorithm-friendly way to convert categorical feature to numeric format.
In general, majority of algorithms prefer some logic behind the numerical value assignment, that being sequence, hierarchy, or other. It will also make your results more valid and definitely scalable and interpretable.
df=df.sort_values(by=['Salary'])
df['code'] = pd.factorize(df['Position'])[0]
Since we already know that the sequence of numbers in “code” by default follows the order of the original dataframe df (Step 2.1), what we will do first is sort the original df by “Salary” feature values and then convert “Position” feature to numerical format and store it as “code“.
Let’s take a look at the result:
print(df)
Output:
Position Salary code
0 Customer Service 44000 0
2 Assistant Manager 65000 1
1 Manager 75000 2
3 Director 90000 3
Conclusion
In this tutorial we discussed how to perform label encoding in Python.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Machine Learning articles.