Top PySpark Built-in DataFrame Functions Explained
In this tutorial, we walk through the most frequently used PySpark functions such as col(), lit(), when(), expr(), rand() and more.
1️⃣ Setup Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, expr, when, rand
spark = SparkSession.builder.appName("BuiltinFunctionsDemo").getOrCreate()
2️⃣ Create Sample DataFrame
data = [
("Alice", 34),
("Bob", 45),
("Cathy", None)
]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
Output:
+-----+----+
| name| age|
+-----+----+
|Alice| 34|
| Bob| 45|
|Cathy|null|
+-----+----+
3️⃣ Using col() and lit()
df.select(col("name"), col("age"), lit(100).alias("lit_col")).show()
Output:
+-----+----+--------+
| name| age|lit_col |
+-----+----+--------+
|Alice| 34| 100|
| Bob| 45| 100|
|Cathy|null| 100|
+-----+----+--------+
4️⃣ Conditional Logic using when()
df.select("name", "age",
when(col("age") > 40, "Above 40")
.otherwise("Below 40").alias("category")
).show()
Output:
+-----+----+---------+
| name| age| category|
+-----+----+---------+
|Alice| 34| Below 40|
| Bob| 45| Above 40|
|Cathy|null| Below 40|
+-----+----+---------+
5️⃣ Expression Evaluation using expr()
df.select(expr("age + 5 as age_plus_5")).show()
Output:
+-----------+
|age_plus_5 |
+-----------+
| 39 |
| 50 |
| null|
+-----------+
6️⃣ Generate Random Numbers with rand()
df.select("name", rand().alias("random_val")).show()
Output:
+-----+------------------+
| name| random_val|
+-----+------------------+
|Alice|0.6348754580941226|
| Bob|0.2984509329806971|
|Cathy|0.8883241025348764|
+-----+------------------+



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.