How to Use call_function()
in PySpark
Want to call Spark SQL functions dynamically without hardcoding them? In this tutorial, we explore the call_function()
utility in PySpark, which allows you to apply SQL functions dynamically using variables—perfect for reusable pipelines and configurable logic.
📘 Sample DataFrame
data = [
("Aamir", 3000.456),
("Bob", 4999.899),
("Charlie", 2500.5)
]
df = spark.createDataFrame(data, ["name", "salary"])
df.show()
Output:
+-------+--------+
| name | salary |
+-------+--------+
| Aamir | 3000.45|
| Bob | 4999.89|
| Charlie|2500.5 |
+-------+--------+
🔄 Use call_function to Dynamically Call SQL Functions
from pyspark.sql.functions import col, call_function
# Uppercase name
df = df.withColumn("name_upper", call_function("upper", col("name")))
df.select("name", "name_upper").show()
Output:
+-------+-----------+
| name | name_upper|
+-------+-----------+
| Aamir | AAMIR |
| Bob | BOB |
| Charlie|CHARLIE |
+-------+-----------+
💲 Format Number with call_function
df = df.withColumn("salary_formatted", call_function("format_number", col("salary"), lit(2)))
df.select("salary", "salary_formatted").show()
Output:
+--------+----------------+
| salary | salary_formatted|
+--------+----------------+
|3000.456| 3,000.46 |
|4999.899| 4,999.90 |
|2500.5 | 2,500.50 |
+--------+----------------+
📏 Dynamically Apply Any SQL Function
func_name = "length"
df = df.withColumn("name_length", call_function(func_name, col("name")))
df.select("name", "name_length").show()
Output:
+--------+------------+
| name | name_length|
+--------+------------+
| Aamir | 5 |
| Bob | 3 |
| Charlie| 7 |
+--------+------------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.