What is explain() in PySpark - Spark Logical vs Physical Plan - PySpark Tutorial for Beginners

How to Use explain() Function in PySpark | Logical vs Physical Plan

What is explain() in PySpark?

The explain() function in PySpark is used to understand how Spark plans to execute a query. It shows:

  • Logical Plan – what you want to do
  • Physical Plan – how Spark will do it

This is useful for debugging and performance tuning.

Step 1: Create Sample DataFrame

data = [
    ("Aamir Shahzad", "Engineering", 100000),
    ("Ali Raza", "HR", 70000),
    ("Bob", "Engineering", 80000),
    ("Lisa", "Marketing", 65000),
    ("Aamir Shahzad", "Engineering", 100000)
]

columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

Step 2: Run explain() on a Filter Operation

filtered_df = df.filter(df.salary > 75000)
print("📌 Physical Plan (default explain):")
filtered_df.explain()

Step 3: Use explain('extended') to view all plan stages

print("📌 Full Plan (explain with 'extended'):")
filtered_df.explain("extended")

Summary

  • Use explain() to debug and optimize your PySpark queries.
  • It reveals how Spark interprets and plans your DataFrame code.

📺 Watch the Full Tutorial

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.