PySpark Tutorial: How to use withColumnsRenamed function in PySpark
This tutorial covers how to rename multiple columns in a PySpark DataFrame using the withColumnsRenamed()
function. It’s a cleaner, reusable alternative to chaining multiple withColumnRenamed()
calls.
What is withColumnsRenamed() in PySpark?
PySpark introduced the withColumnsRenamed()
function in version 3.4.0 to allow renaming multiple columns at once. You provide a dictionary mapping old column names to new ones.
Step 1: Create Sample Data
data = [
("Aamir Shahzad", "Pakistan", 25),
("Ali Raza", "USA", 30),
("Bob", "UK", 45),
("Lisa", "Canada", 35)
]
df = spark.createDataFrame(data, ["FullName", "Country", "AgeYears"])
print("Original DataFrame:")
df.show()
+--------------+--------+--------+
| FullName | Country|AgeYears|
+--------------+--------+--------+
|Aamir Shahzad |Pakistan| 25|
| Ali Raza | USA | 30|
| Bob | UK | 45|
| Lisa | Canada | 35|
+--------------+--------+--------+
Step 2: Rename Multiple Columns using withColumnsRenamed()
renamed_df = df.withColumnsRenamed({
"FullName": "Name",
"Country": "Nationality",
"AgeYears": "Age"
})
print("DataFrame with Renamed Columns:")
renamed_df.show()
+--------------+------------+---+
| Name | Nationality|Age|
+--------------+------------+---+
|Aamir Shahzad | Pakistan | 25|
| Ali Raza | USA | 30|
| Bob | UK | 45|
| Lisa | Canada | 35|
+--------------+------------+---+
Why use withColumnsRenamed()?
- Cleaner syntax for renaming multiple columns.
- More readable and concise than chaining multiple
withColumnRenamed()
calls. - Reduces the chance of human error when renaming several columns.
No comments:
Post a Comment