Label Encoding in Python

In this tutorial we will discuss label encoding in Python.

Table of Contents

Introduction

Label encoding explained
Advantages and disadvantages of label encoding
Label encoding in Python

Conclusion

Introduction

In data science, we often work with datasets that contain categorical variables, where the values are represented by strings. For example, when we work with datasets for salary estimation based on different sets of features, we often see job title being entered in words, for example: Manager, Director, Vice-President, President, and so on. The complication it creates is the fact that machine learning algorithms in fact can work with categorical features, yet they have to be in numeric form.

There are multiple ways to solve this problem and a lot depends on the algorithm you will be working with. And how sensitive it is to the ranges and distributions of numerical features.

Two of the most common approaches are:

Label Encoding
One-Hot Encoding

Both techniques allow for conversion from categorical/text data to numeric format. These are valid solutions with their own benefits and costs. In this article we will focus on label encoding and it’s variations. We will also outline cases when it should/shouldn’t be applied.

To continue following this tutorial we will need the following two Python libraries: sklearn and pandas. If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:


pip install sklearn
pip install pandas

Label encoding explained

To get a sense how label encoding works, let’s take a look at the following dataset:

$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} \\
\hline
\text{Customer Service} & 44,000 \\
\text{Manager} & 75,000 \\
\text{Assistant Manager} & 65,000 \\
\text{Director} & 90,000 \\
\end{array}
\end{matrix}
$$

Assume it is the data that we would like to feed into some machine learning algorithm. Every row represents a position that an individual holds and the corresponding annual salary.

The “Position” feature is all text and it is what we will need to convert into model-friendly numeric format.

The question that arises is how do we assign numeric values to text categorical data?

Note: for the purposes of this article consider the range of the numbers we can assign between 0 and $+\infty$ with 0 being the smallest number.

Here are a few possible ways we can assign numeric values to the “Position” feature:

Option 1: Using current order

$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} & \text{code} \\
\hline
\text{Customer Service} & 44,000 & 0\\
\text{Manager} & 75,000 & 1 \\
\text{Assistant Manager} & 65,000 & 2 \\
\text{Director} & 90,000 & 3 \\
\end{array}
\end{matrix}
$$

Option 2: Using alphabetical order

$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} & \text{code} \\
\hline
\text{Customer Service} & 44,000 & 1 \\
\text{Manager} & 75,000 & 3\\
\text{Assistant Manager} & 65,000 & 0 \\
\text{Director} & 90,000 & 2\\
\end{array}
\end{matrix}
$$

Option 3: Using “Salary” feature order

$$
\begin{matrix}
\begin{array}{c|c}
\text{Position} & \text{Salary} & \text{code} \\
\hline
\text{Customer Service} & 44,000 & 0 \\
\text{Manager} & 75,000 & 2 \\
\text{Assistant Manager} & 65,000 & 1 \\
\text{Director} & 90,000 & 3 \\
\end{array}
\end{matrix}
$$

Which one is correct to use?

Well there is no definite answer. It all depends. Depends on the algorithm you are going to feed these features to. It also really depends on your dataset.

For example, if you are going to use simple linear regression (OLS) to estimate an individual’s salary as a function of their position, you should only use option 3. Here is why: when you convert this feature to a numerical format, the algorithm doesn’t understand the structure of your hierarchy. It now treats everything as numbers and it safely assumes the following: 0<1<2<3.

If you followed option 1, then the algorithm will see the position of “Assistant Manager” being superior to “Manager”.
If you followed option 2, then the algorithm will see the position of “Customer Service” being superior to “Assistant Manager”.

This is not what you want to have right? As it can provide misleading estimates and cause false predictions (which can potentially be statistically significant) but only because the algorithm treats the numeric feature sequentially from smallest to largest.

Now, the interesting part comes when you decide to implement this in python. There are a handful of ways to achieve the same result, and we will discuss a few of them below.

Advantages and disadvantages of label encoding

It is important to understand the benefits and drawbacks of label encoding and also consider other available encoding techniques.

Advantages:

It is easy to implement and interpret

It is visually user friendly
It works best with smaller number of unique categorical values

Disadvantages:

It can skew the estimation results if an algorithm is very sensitive to feature magnitude (like SVM). In such case you may consider standardizing or normalizing values after encoding.
It can skew the estimation results if there is a large number of unique categorical values. In our case it was 4, but if it’s 10 or more, you should keep this in mind. In such case you should look into other encoding techniques, for example, one hot encoding.

In this part we will cover a few different ways of how to do label encoding in Python.

Two of the most popular approaches:

LabelEncoder() from scikit-learn library
pandas.factorize() from pandas library

Once the libraries are downloaded and installed, we can proceed with Python code implementation.

Step 1: Create a dataframe with the required data


import pandas as pd

df = {'Position': ['Customer Service','Manager','Assistant Manager','Director'],
    'Salary': [44000,75000,65000,90000]
    }
df = pd.DataFrame(df)

First we import pandas library as it will be required to create a pandas dataframe. Then we create a Python dictionary df and convert it to dataframe.

Let’s take a look at the result:


print(df)


Output:

            Position  Salary
0   Customer Service   44000
1            Manager   75000
2  Assistant Manager   65000
3           Director   90000

Step 2.1: Label encoding in Python using current order


df['code'] = pd.factorize(df['Position'])[0]

We create a new feature “code” and assign categorical feature “position” in numerical format to it.

The sequence of numbers in “code” by default follows the order of the original dataframe df:


print(df)


Output:

            Position  Salary  code
0   Customer Service   44000     0
1            Manager   75000     1
2  Assistant Manager   65000     2
3           Director   90000     3

Step 2.2: Label encoding in Python using alphabetical order

This case is a little more interesting as we can achieve the same result using both of the methods mentioned earlier.

scikit-learn method


from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

df['code']= le.fit_transform(df['Position'])

We will first import LabelEncoder() from sci-kit learn library and define le as its instance. Then we will apply it to the “Position” feature to convert it to numerical format and store as a new feature “code“.

What’s interesting about this method is that by default LabelEncoder() orders values in alphabetical order without us having to specify anything.

Let’s take a look what we arrived at:


print(df)


Output:

            Position  Salary  code
0   Customer Service   44000     1
1            Manager   75000     3
2  Assistant Manager   65000     0
3           Director   90000     2

LabelEncoder() correctly order the values in “Position” feature and generated the corresponding numerical values in the following sequence: Assistant Manager, Customer Service, Director, Manager.

pandas method


df['code'] = pd.factorize(df['Position'], sort=True)[0]

What’s different from Step 2.1 where we worked with original order, we added “sort=True” (alphabetically) parameter to identify that we need the conversion to numerical format of the sorted “Position” feature.

Let’s take a look at the result:


print(df)


Output:

            Position  Salary  code
0   Customer Service   44000     1
1            Manager   75000     3
2  Assistant Manager   65000     0
3           Director   90000     2

We can see that both scikit-learn method and pandas method generate the same result.

Step 2.3: Label encoding in Python using “Salary” feature order

As we discussed in the Understanding Label Encoding section, most likely this will be the most algorithm-friendly way to convert categorical feature to numeric format.

In general, majority of algorithms prefer some logic behind the numerical value assignment, that being sequence, hierarchy, or other. It will also make your results more valid and definitely scalable and interpretable.


df=df.sort_values(by=['Salary'])

df['code'] = pd.factorize(df['Position'])[0]

Since we already know that the sequence of numbers in “code” by default follows the order of the original dataframe df (Step 2.1), what we will do first is sort the original df by “Salary” feature values and then convert “Position” feature to numerical format and store it as “code“.

Let’s take a look at the result:


print(df)


Output:

            Position  Salary  code
0   Customer Service   44000     0
2  Assistant Manager   65000     1
1            Manager   75000     2
3           Director   90000     3

Conclusion

In this tutorial we discussed how to perform label encoding in Python.

Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Machine Learning articles.

Label Encoding in Python

Introduction

Label encoding explained

Advantages and disadvantages of label encoding