What is explain() in PySpark?
The explain()
function in PySpark is used to understand how Spark plans to execute a query. It shows:
- Logical Plan – what you want to do
- Physical Plan – how Spark will do it
This is useful for debugging and performance tuning.
Step 1: Create Sample DataFrame
data = [
("Aamir Shahzad", "Engineering", 100000),
("Ali Raza", "HR", 70000),
("Bob", "Engineering", 80000),
("Lisa", "Marketing", 65000),
("Aamir Shahzad", "Engineering", 100000)
]
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
Step 2: Run explain() on a Filter Operation
filtered_df = df.filter(df.salary > 75000)
print("📌 Physical Plan (default explain):")
filtered_df.explain()
Step 3: Use explain('extended') to view all plan stages
print("📌 Full Plan (explain with 'extended'):")
filtered_df.explain("extended")
Summary
- Use
explain()
to debug and optimize your PySpark queries. - It reveals how Spark interprets and plans your DataFrame code.
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.