PySpark DataFrame.to() Function | Schema Reconciliation and Column Reordering Made Easy

🧩 PySpark DataFrame.to() Function

Schema Reconciliation and Column Reordering Made Easy

Learn how to use the DataFrame.to() function introduced in PySpark 3.4.0 to reorder columns and reconcile schema effortlessly.

📘 Introduction

PySpark’s DataFrame.to() function helps with schema alignment, column reordering, and type casting — all in one step. This is especially useful when you're writing to tables, preparing data for joins, or ensuring schema compliance.

🔧 PySpark Code Example

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, BooleanType

# Create SparkSession
spark = SparkSession.builder.appName("DataFrame.to Example").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", "Math", 85, True),
    ("Ali Raza", "Science", 78, False),
    ("Bob", "History", 92, True),
    ("Lisa", "Math", 80, False)
]

columns = ["Name", "Subject", "Score", "Passed"]
df = spark.createDataFrame(data, columns)

# Print original schema and data
df.show(truncate=False)
df.printSchema()

# Define new schema with reordered and altered types
schema = StructType([
    StructField("Passed", BooleanType(), True),
    StructField("Score", StringType(), True),  # Cast from int to string
    StructField("Name", StringType(), True),
    StructField("Subject", StringType(), True)
])

# Apply .to() transformation
df2 = df.to(schema)

# Show transformed DataFrame
df2.show(truncate=False)
df2.printSchema()

📊 Original DataFrame Output

+-------------+--------+-----+-------+
| Name        |Subject |Score|Passed |
+-------------+--------+-----+-------+
|Aamir Shahzad|Math    |   85|   true|
|Ali Raza     |Science |   78|  false|
|Bob          |History |   92|   true|
|Lisa         |Math    |   80|  false|
+-------------+--------+-----+-------+

root
 |-- Name: string (nullable = true)
 |-- Subject: string (nullable = true)
 |-- Score: long (nullable = true)
 |-- Passed: boolean (nullable = true)

✅ New DataFrame Output (After `.to()`)

+-------+-----+-------------+--------+
|Passed |Score|Name         |Subject |
+-------+-----+-------------+--------+
|   true|85   |Aamir Shahzad|Math    |
|  false|78   |Ali Raza     |Science |
|   true|92   |Bob          |History |
|  false|80   |Lisa         |Math    |
+-------+-----+-------------+--------+

root
 |-- Passed: boolean (nullable = true)
 |-- Score: string (nullable = true)
 |-- Name: string (nullable = true)
 |-- Subject: string (nullable = true)

💡 Key Takeaways

DataFrame.to() allows column reordering and type conversion.
It is especially useful when writing to pre-defined tables or aligning multiple datasets.
Introduced in PySpark 3.4.0 and above.

🎥 Watch the Video Tutorial

Watch on YouTube

Welcome To TechBrothersIT

Label

PySpark Tutorial : PySpark DataFrame.to Function | Schema Reconciliation and Column Reordering Made Easy

🧩 PySpark DataFrame.to() Function

Schema Reconciliation and Column Reordering Made Easy

📘 Introduction

🔧 PySpark Code Example

📊 Original DataFrame Output

✅ New DataFrame Output (After `.to()`)

💡 Key Takeaways

🎥 Watch the Video Tutorial

No comments:

Post a Comment

Label

PySpark Tutorial : PySpark DataFrame.to Function | Schema Reconciliation and Column Reordering Made Easy

📘 Introduction

🔧 PySpark Code Example

📊 Original DataFrame Output

✅ New DataFrame Output (After .to())

💡 Key Takeaways

🎥 Watch the Video Tutorial

No comments:

Post a Comment

✅ New DataFrame Output (After `.to()`)