Top PySpark Built-in DataFrame Functions Explained
In this tutorial, we walk through the most frequently used PySpark functions such as col()
, lit()
, when()
, expr()
, rand()
and more.
1️⃣ Setup Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, expr, when, rand
spark = SparkSession.builder.appName("BuiltinFunctionsDemo").getOrCreate()
2️⃣ Create Sample DataFrame
data = [
("Alice", 34),
("Bob", 45),
("Cathy", None)
]
df = spark.createDataFrame(data, ["name", "age"])
df.show()
Output:
+-----+----+
| name| age|
+-----+----+
|Alice| 34|
| Bob| 45|
|Cathy|null|
+-----+----+
3️⃣ Using col()
and lit()
df.select(col("name"), col("age"), lit(100).alias("lit_col")).show()
Output:
+-----+----+--------+
| name| age|lit_col |
+-----+----+--------+
|Alice| 34| 100|
| Bob| 45| 100|
|Cathy|null| 100|
+-----+----+--------+
4️⃣ Conditional Logic using when()
df.select("name", "age",
when(col("age") > 40, "Above 40")
.otherwise("Below 40").alias("category")
).show()
Output:
+-----+----+---------+
| name| age| category|
+-----+----+---------+
|Alice| 34| Below 40|
| Bob| 45| Above 40|
|Cathy|null| Below 40|
+-----+----+---------+
5️⃣ Expression Evaluation using expr()
df.select(expr("age + 5 as age_plus_5")).show()
Output:
+-----------+
|age_plus_5 |
+-----------+
| 39 |
| 50 |
| null|
+-----------+
6️⃣ Generate Random Numbers with rand()
df.select("name", rand().alias("random_val")).show()
Output:
+-----+------------------+
| name| random_val|
+-----+------------------+
|Alice|0.6348754580941226|
| Bob|0.2984509329806971|
|Cathy|0.8883241025348764|
+-----+------------------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.