How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark
In this guide, you will learn how to work with various functions in PySpark to select, manipulate, and transform data efficiently in your data engineering projects.
Topics Covered
- select() - Retrieve specific columns.
- selectExpr() - Use SQL expressions.
- col() - Reference columns.
- expr() - Perform expressions.
- when() - Conditional logic.
- lit() - Add constant columns.
1. Sample DataFrame Creation
from pyspark.sql.functions import col, expr, when, lit
# Sample Data
data = [
(1, "Alice", 5000, "IT", 25),
(2, "Bob", 6000, "HR", 30),
(3, "Charlie", 7000, "Finance", 35),
(4, "David", 8000, "IT", 40),
(5, "Eve", 9000, "HR", 45)
]
# Creating DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])
# Show DataFrame
df.show()
2. Selecting Specific Columns
df.select("name", "salary").show()
3. Using col() Function
df.select(col("name"), col("department")).show()
4. Renaming Columns Using alias()
df.select(col("name").alias("Employee_Name"), col("salary").alias("Employee_Salary")).show()
5. Using Expressions in select()
df.select("name", "salary", expr("salary * 1.10 AS increased_salary")).show()
6. Using Conditional Expressions with when()
df.select(
"name",
"salary",
when(col("salary") > 7000, "High").otherwise("Low").alias("Salary_Category")
).show()
7. Using selectExpr() for SQL-like Expressions
df.selectExpr("name", "salary * 2 as double_salary").show()
8. Adding Constant Columns Using lit()
df.select("name", "department", lit("Active").alias("status")).show()
9. Selecting Columns Dynamically
columns_to_select = ["name", "salary", "department"]
df.select(*columns_to_select).show()
10. Selecting All Columns Except One
df.select([column for column in df.columns if column != "age"]).show()
No comments:
Post a Comment