How to Use toDF() in PySpark - Rename All DataFrame Columns Fast | PySpark Tutorial for Beginners

How to Use toDF() in PySpark | Rename All DataFrame Columns Fast

How to Use toDF() in PySpark – Rename All DataFrame Columns Fast

PySpark’s toDF() function allows you to rename all columns in a DataFrame in one go. This is especially helpful when dealing with raw datasets or when you want cleaner, more readable column names for downstream processing.

Step 1: Create a Sample DataFrame

data = [
  (1, "Aamir Shahzad"),
  (2, "Ali Raza"),
  (3, "Bob"),
  (4, "Lisa")
]
original_df = spark.createDataFrame(data, ["id", "name"])
print("📌 Original DataFrame with Default Column Names:")
original_df.show()

Step 2: Rename All Columns Using toDF()

renamed_df = original_df.toDF("user_id", "FirstName")
print("📌 DataFrame After Renaming Columns Using toDF():")
renamed_df.show()

Step 3: Compare Schema Before and After

print("📌 Schema Before:")
original_df.printSchema()

print("📌 Schema After:")
renamed_df.printSchema()

Step 4: Example with Mismatched Column Count (Error Case)

try:
    error_df = original_df.toDF("only_one_column")
except Exception as e:
    print("❌ Error: ", e)

Summary

  • toDF() is a quick way to rename all columns at once.
  • The number of new column names must exactly match the number of columns.
  • Helpful for schema clean-up and transforming raw ingestion data.

📺 Watch the Full Tutorial

No comments:

Post a Comment