Transforming Arrays and Maps in PySpark
This tutorial explains advanced functions in PySpark to manipulate array and map collections using:
transform()
filter()
zip_with()
Sample Data Setup
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, transform, filter, zip_with
from pyspark.sql.types import ArrayType, StringType
spark = SparkSession.builder.appName("TransformFilterZipWith").getOrCreate()
data = [("John", [1, 2, 3, 4]), ("Lisa", [10, 20, 30]), ("Aamir", [5, 15, 25])]
df = spark.createDataFrame(data, ["name", "numbers"])
df.show()
Output:
+-----+------------+
| name| numbers|
+-----+------------+
| John| [1, 2, 3, 4]|
| Lisa|[10, 20, 30]|
|Aamir| [5, 15, 25]|
+-----+------------+
1️⃣ transform()
Definition: Applies a transformation to each element in the array.
df.select("name", transform("numbers", lambda x: x * 2).alias("transformed")).show()
Output:
+-----+------------------+
| name| transformed|
+-----+------------------+
| John| [2, 4, 6, 8] |
| Lisa| [20, 40, 60] |
|Aamir| [10, 30, 50] |
+-----+------------------+
2️⃣ filter()
Definition: Filters array elements based on a condition.
df.select("name", filter("numbers", lambda x: x > 10).alias("filtered")).show()
Output:
+-----+----------------+
| name| filtered|
+-----+----------------+
| John| [] |
| Lisa| [20, 30] |
|Aamir| [15, 25] |
+-----+----------------+
3️⃣ zip_with()
Definition: Combines two arrays element-wise using a binary function.
from pyspark.sql.functions import lit, array
df2 = df.withColumn("multiplier", array(lit(10), lit(20), lit(30), lit(40)))
df2.select("name", zip_with("numbers", "multiplier", lambda x, y: x + y).alias("zipped")).show()
Output:
+-----+------------------+
| name| zipped |
+-----+------------------+
| John| [11, 22, 33, 44] |
| Lisa| [30, 40, 60] |
|Aamir| [15, 35, 55] |
+-----+------------------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.