🧩 PySpark DataFrame.to() Function
Schema Reconciliation and Column Reordering Made Easy
Learn how to use the DataFrame.to() function introduced in PySpark 3.4.0 to reorder columns and reconcile schema effortlessly.
📘 Introduction
PySpark’s DataFrame.to() function helps with schema alignment, column reordering, and type casting — all in one step. This is especially useful when you're writing to tables, preparing data for joins, or ensuring schema compliance.
🔧 PySpark Code Example
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType
# Create SparkSession
spark = SparkSession.builder.appName("DataFrame.to Example").getOrCreate()
# Sample data
data = [
    ("Aamir Shahzad", "Math", 85, True),
    ("Ali Raza", "Science", 78, False),
    ("Bob", "History", 92, True),
    ("Lisa", "Math", 80, False)
]
columns = ["Name", "Subject", "Score", "Passed"]
df = spark.createDataFrame(data, columns)
# Print original schema and data
df.show(truncate=False)
df.printSchema()
# Define new schema with reordered and altered types
schema = StructType([
    StructField("Passed", BooleanType(), True),
    StructField("Score", StringType(), True),  # Cast from int to string
    StructField("Name", StringType(), True),
    StructField("Subject", StringType(), True)
])
# Apply .to() transformation
df2 = df.to(schema)
# Show transformed DataFrame
df2.show(truncate=False)
df2.printSchema()
📊 Original DataFrame Output
+-------------+--------+-----+-------+
| Name        |Subject |Score|Passed |
+-------------+--------+-----+-------+
|Aamir Shahzad|Math    |   85|   true|
|Ali Raza     |Science |   78|  false|
|Bob          |History |   92|   true|
|Lisa         |Math    |   80|  false|
+-------------+--------+-----+-------+
root
 |-- Name: string (nullable = true)
 |-- Subject: string (nullable = true)
 |-- Score: long (nullable = true)
 |-- Passed: boolean (nullable = true)
✅ New DataFrame Output (After .to())
    +-------+-----+-------------+--------+
|Passed |Score|Name         |Subject |
+-------+-----+-------------+--------+
|   true|85   |Aamir Shahzad|Math    |
|  false|78   |Ali Raza     |Science |
|   true|92   |Bob          |History |
|  false|80   |Lisa         |Math    |
+-------+-----+-------------+--------+
root
 |-- Passed: boolean (nullable = true)
 |-- Score: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Subject: string (nullable = true)
💡 Key Takeaways
- DataFrame.to()allows column reordering and type conversion.
- It is especially useful when writing to pre-defined tables or aligning multiple datasets.
- Introduced in PySpark 3.4.0 and above.



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.