How to Use toDF()
in PySpark – Rename All DataFrame Columns Fast
PySpark’s toDF()
function allows you to rename all columns in a DataFrame in one go. This is especially helpful when dealing with raw datasets or when you want cleaner, more readable column names for downstream processing.
Step 1: Create a Sample DataFrame
data = [
(1, "Aamir Shahzad"),
(2, "Ali Raza"),
(3, "Bob"),
(4, "Lisa")
]
original_df = spark.createDataFrame(data, ["id", "name"])
print("📌 Original DataFrame with Default Column Names:")
original_df.show()
Step 2: Rename All Columns Using toDF()
renamed_df = original_df.toDF("user_id", "FirstName")
print("📌 DataFrame After Renaming Columns Using toDF():")
renamed_df.show()
Step 3: Compare Schema Before and After
print("📌 Schema Before:")
original_df.printSchema()
print("📌 Schema After:")
renamed_df.printSchema()
Step 4: Example with Mismatched Column Count (Error Case)
try:
error_df = original_df.toDF("only_one_column")
except Exception as e:
print("❌ Error: ", e)
Summary
toDF()
is a quick way to rename all columns at once.- The number of new column names must exactly match the number of columns.
- Helpful for schema clean-up and transforming raw ingestion data.
No comments:
Post a Comment