In this article we will explore how to normalize data in Python.
Table of Contents
Introduction
One of the first steps in feature engineering for many machine learning models is ensuring that the data is scaled properly.
Some models, such as linear regression, KNN, and SVM, for example, are heavily affected by features with different scales.
While others, such as decision trees, bagging, and boosting algorithms generally do not require any data scaling.
The level of effect of features’ scales on mentioned models is high, and features with larger ranges of values will play a bigger role in the decision making of the algorithm since impacts they produce have larger effect on the outputs.
In such cases, we turn to feature scaling to help us find common level for all these features to be evaluated equally when training the model.
Two most popular feature scaling techniques are:
- Z-Score Standardization
- Min-Max Normalization
In this article, we will discuss how to perform min-max normalization of data using Python.
To continue following this tutorial we will need the following two Python libraries: pandas and sklearn.
If you don’t have them installed, please open “Command Prompt” (on Windows) and install them using the following code:
pip install pandas
pip install sklearn
What is normalization
In statistics and machine learning, min-max normalization of data is a process of converting original range of data to the range between 0 and 1.
The resulting normalized values represent the original data on 0-1 scale.
This will allow us to compare multiple features together and get more relevant information since now all the data will be on the same scale.
In min-max normalization, for every feature, its minimum value gets transformed into 0 and its maximum value gets transformed into 1. All values in-between get scaled to be within 0-1 range based on the original value relative to minimum and maximum values of the feature.
Suppose you have an array of numbers \(A = [v_1, v_2, …, v_i]\).
We will first find the minimum and maximum values of the array: \(min_A\) and \(max_A\).
Then, using the min and max values we will transform each original value \(v_i\) into a min-max normalized value \(v’_i\) using the follwoing formula:
$$v’_i = \frac{v_i – min_A}{max_A – min_A}$$
Normalization example
In this section we will take a look at a simple example of data normalization.
Consider the following dataset with prices of different apples:
Weight in g | Price in $ |
300 | 3 |
250 | 2 |
800 | 5 |
And plotting this dataset should look like this:
Here we see a much larger variation of the weight compared to price, but it appears to looks like this because of different scales of the data.
The prices range is between $2 and $5, whereas the weight range is between 250g and 800g.
Let’s normalize this data!
Start with the weight feature:
Observation | \(v_i\) | \(v’_i = \frac{v_i – min_W}{max_W – min_W}\) |
1 | 300 | \(\frac{300-250}{800-250} = 0.09\) |
2 | 250 | \(\frac{250-250}{800-250} = 0\) |
3 | 800 | \(\frac{800-250}{800-250} = 1\) |
\(min_W\) | 250 | |
\(max_W\) | 800 |
And do the same for the price feature:
Observation | \(v_i\) | \(v’_i = \frac{v_i – min_P}{max_P – min_P}\) |
1 | 3 | \(\frac{3-2}{5-2} = 0.33\) |
2 | 2 | \(\frac{2-2}{5-2} = 0\) |
3 | 5 | \(\frac{5-2}{5-2} = 1\) |
\(min_P\) | 2 | |
\(max_P\) | 5 |
And combine the two features into one dataset:
Weight (normalized) | Price (normalized) |
0.09 | 0.33 |
0 | 0 |
1 | 1 |
We can now see that the scale of the features in the dataset is very similar, and when visualizing the data, the spread between the points will be smaller:
The graph looks almost identical with the only difference being the scale of the each axis.
Now let’s see how we can recreate this example using Python!
How to normalize data in Python
Let’s start by creating a dataframe that we used in the example above:
import pandas as pd
data = {'weight':[300, 250, 800],
'price':[3, 2, 5]}
df = pd.DataFrame(data)
print(df)
And you should get:
weight price 0 300 3 1 250 2 2 800 5
Once we have the data ready, we can use the MinMaxScaler() class and its methods (from sklearn library) to normalize the data:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(df)
print(normalized_data)
And you should get:
[[0.09090909 0.33333333] [0. 0. ] [1. 1. ]]
As you can see, the above code returned an array, so the last step would be to convert it to dataframe:
normalized_df = pd.DataFrame(normalized_data, columns=df.columns)
print(normalized_df)
And you should get:
weight price 0 0.090909 0.333333 1 0.000000 0.000000 2 1.000000 1.000000
which is identical to the result in the example which we calculated manually.
Conclusion
In this tutorial we discussed how to normalize data in Python.
Feel free to leave comments below if you have any questions or have suggestions for some edits and check out more of my Machine Learning articles.