How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark | PySpark Tutorial

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark | Step-by-Step Guide

How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark

In this guide, you will learn how to work with various functions in PySpark to select, manipulate, and transform data efficiently in your data engineering projects.

Topics Covered

  • select() - Retrieve specific columns.
  • selectExpr() - Use SQL expressions.
  • col() - Reference columns.
  • expr() - Perform expressions.
  • when() - Conditional logic.
  • lit() - Add constant columns.

1. Sample DataFrame Creation

from pyspark.sql.functions import col, expr, when, lit

# Sample Data
data = [
    (1, "Alice", 5000, "IT", 25),
    (2, "Bob", 6000, "HR", 30),
    (3, "Charlie", 7000, "Finance", 35),
    (4, "David", 8000, "IT", 40),
    (5, "Eve", 9000, "HR", 45)
]

# Creating DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary", "department", "age"])

# Show DataFrame
df.show()

2. Selecting Specific Columns

df.select("name", "salary").show()

3. Using col() Function

df.select(col("name"), col("department")).show()

4. Renaming Columns Using alias()

df.select(col("name").alias("Employee_Name"), col("salary").alias("Employee_Salary")).show()

5. Using Expressions in select()

df.select("name", "salary", expr("salary * 1.10 AS increased_salary")).show()

6. Using Conditional Expressions with when()

df.select(
    "name",
    "salary",
    when(col("salary") > 7000, "High").otherwise("Low").alias("Salary_Category")
).show()

7. Using selectExpr() for SQL-like Expressions

df.selectExpr("name", "salary * 2 as double_salary").show()

8. Adding Constant Columns Using lit()

df.select("name", "department", lit("Active").alias("status")).show()

9. Selecting Columns Dynamically

columns_to_select = ["name", "salary", "department"]
df.select(*columns_to_select).show()

10. Selecting All Columns Except One

df.select([column for column in df.columns if column != "age"]).show()

Watch the Video Tutorial

Watch on YouTube

Author: Aamir Shahzad

No comments:

Post a Comment