How to Use unionByName() to Join DataFrames by Column Names | PySpark Tutorial Learn by Doing it

PySpark Tutorial: unionByName() Function for Joining DataFrames

PySpark Tutorial: unionByName() Function for Joining DataFrames

This tutorial demonstrates how to use the unionByName() function in PySpark to combine two DataFrames by matching column names.

What is unionByName() in PySpark?

The unionByName() function combines two DataFrames by aligning columns with the same name, regardless of their order.

  • Column names must match
  • Useful when schemas are the same but column orders differ

Step 1: Create a Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySpark unionByName Example").getOrCreate()

Step 2: Create Sample DataFrames

# DataFrame 1
data1 = [("Aamir Shahzad", "Pakistan", 25),
         ("Ali Raza", "USA", 30)]
df1 = spark.createDataFrame(data1, ["Name", "Country", "Age"])

# DataFrame 2 (Different column order)
data2 = [("Bob", 45, "UK"),
         ("Lisa", 35, "Canada")]
df2 = spark.createDataFrame(data2, ["Name", "Age", "Country"])
DataFrame 1:
+--------------+---------+---+
| Name | Country | Age |
+--------------+---------+---+
| Aamir Shahzad| Pakistan| 25 |
| Ali Raza | USA | 30 |
+--------------+---------+---+

DataFrame 2:
+-----+---+--------+
|Name |Age|Country |
+-----+---+--------+
|Bob | 45|UK |
|Lisa | 35|Canada |
+-----+---+--------+

Step 3: Use unionByName() to Combine DataFrames

union_df = df1.unionByName(df2)

Step 4: Show Result

print("Union Result:")
union_df.show()
Union Result:
+--------------+---------+---+
| Name | Country | Age |
+--------------+---------+---+
| Aamir Shahzad| Pakistan| 25 |
| Ali Raza | USA | 30 |
| Bob | UK | 45 |
| Lisa | Canada | 35 |
+--------------+---------+---+

Why Use unionByName()?

  • Safer alternative to union() when column orders might differ
  • Prevents data from being mismatched due to incorrect column alignment
  • Great for combining data from different sources with consistent column names

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.

No comments:

Post a Comment