In this article we will explore how to perform feature engineering with VectorAssembler in PySpark.
Table of contents:
- Introduction
- Create SparkSession with PySpark
- Create Spark DataFrame with PySpark
- Create a single vector column using VectorAssembler in PySpark
- Conclusion
Introduction
In Python, especially when working with sklearn, most of the models can take raw DataFrames as an input for training. In a distributed environment it can be a little more complicated, as we should be using Assemblers to prepare our training data.
VectorAssember from Spark ML library is a module that allows to convert numerical features into a single vector that is used by the machine learning models.
As an overview, what is does is it takes a list of columns (features) and combines it into a single vector column (feature vector). It is then used as an input into the machine learning models in Spark ML.
To continue following this tutorial we will need Spark and Java installed on your machine and the following Python library: pyspark.
pip install pyspark
Create SparkSession with PySpark
The first step and the main entry point to all Spark functionality is the SparkSession class:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('mysession').getOrCreate()
Create Spark DataFrame with PySpark
As the next step we will create a simple Spark DataFrame with three features (“Age”, “Experience”, “Education”) and a target variable (“Salary”):
df = spark.createDataFrame(
[
(20, 1, 2, 22000),
(25, 2, 3, 30000),
(36, 12, 6, 70000),
],
["Age", "Experience", "Education", "Salary"]
)
Let’s take a look:
df.show()
+---+----------+---------+------+
|Age|Experience|Education|Salary|
+---+----------+---------+------+
| 20| 1| 2| 22000|
| 25| 2| 3| 30000|
| 36| 12| 6| 70000|
+---+----------+---------+------+
For this example, the DataFrame is simple with all the data of numerical type. When working on projects with other datasets you should always correctly identify and convert the data types, check for null values, and do the required data transformations.
Create a single vector column using VectorAssembler in PySpark
Our goal in this step is to combine the three numerical features (“Age”, “Experience”, “Education”) into a single vector column (let’s call it “features”).
VectorAssembler will have two parameters:
- inputCols – list of features to combine into a single vector column
- outputCol – the new column that will contain the transformed vector
Let’s create our assembler:
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=["Age", "Experience", "Education"],
outputCol="features")
Now using this assembler we can transform the original dataset and take a look as the result:
output = assembler.transform(df)
output.show()
+---+----------+---------+------+---------------+
|Age|Experience|Education|Salary| features|
+---+----------+---------+------+---------------+
| 20| 1| 2| 22000| [20.0,1.0,2.0]|
| 25| 2| 3| 30000| [25.0,2.0,3.0]|
| 36| 12| 6| 70000|[36.0,12.0,6.0]|
+---+----------+---------+------+---------------+
Perfect! This DataFrame can now be used for training models available in Spark ML by passing “features” vector column as your input variable and “salary” as your target variable.
Conclusion
In this article we explored how to use VectorAssembler for feature engineering in PySpark.
I also encourage you to check out my other posts on Feature Engineering.
Feel free to leave comments below if you have any questions or have suggestions for some edits.