🚀 Speed Up PySpark with pandas_udf()
– Easy Tutorial
Want faster performance in your PySpark jobs? This tutorial covers how to use pandas_udf()
to process data in batches using Pandas under the hood—providing serious speed boosts over regular UDFs.
📘 Sample DataFrame
data = [("apple",), ("banana",), ("kiwi",)]
df = spark.createDataFrame(data, ["fruit"])
df.show()
Output:
+--------+
| fruit |
+--------+
| apple |
| banana |
| kiwi |
+--------+
⚡ Step 1: Define a pandas_udf to Get Length of Fruit Name
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import IntegerType
import pandas as pd
@pandas_udf(IntegerType())
def fruit_length(series: pd.Series) -> pd.Series:
return series.str.len()
df = df.withColumn("length", fruit_length(df["fruit"]))
df.select("fruit", "length").show()
Output:
+--------+------+
| fruit |length|
+--------+------+
| apple | 5|
| banana | 6|
| kiwi | 4|
+--------+------+
🎯 Step 2: Classify Fruit Based on Length
from pyspark.sql.types import StringType
@pandas_udf(StringType())
def classify_fruit(series: pd.Series) -> pd.Series:
return series.apply(lambda name: "long name" if len(name) > 5 else "short name")
df = df.withColumn("length_category", classify_fruit(df["fruit"]))
df.select("fruit", "length_category").show()
Output:
+--------+----------------+
| fruit |length_category|
+--------+----------------+
| apple | short name |
| banana| long name |
| kiwi | short name |
+--------+----------------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.