How to Use Joins in PySpark: Inner, Left, Right & Outer Explained
In this tutorial, we’ll explore how different types of joins work in PySpark, including Inner Join, Left Join, Right Join, and Outer Join.
🔸 Step 1: Create Sample DataFrames
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType
spark = SparkSession.builder.appName("PySparkJoinExample").getOrCreate()
data_people = [
("1", "Aamir Shahzad"),
("2", "Ali Raza"),
("3", "Bob"),
("4", "Lisa"),
("5", "John"),
("6", None)
]
schema_people = StructType([
StructField("id", StringType(), True),
StructField("name", StringType(), True)
])
df_people = spark.createDataFrame(data_people, schema_people)
df_people.show()
Output:
+---+-------------+
| id| name|
+---+-------------+
| 1| Aamir Shahzad|
| 2| Ali Raza|
| 3| Bob|
| 4| Lisa|
| 5| John|
| 6| null|
+---+-------------+
🔸 Step 2: Create Department DataFrame
data_dept = [
("1", "Engineering"),
("2", "Marketing"),
("4", "HR"),
("6", "Sales")
]
schema_dept = StructType([
StructField("id", StringType(), True),
StructField("department", StringType(), True)
])
df_dept = spark.createDataFrame(data_dept, schema_dept)
df_dept.show()
Output:
+---+-----------+
| id| department|
+---+-----------+
| 1|Engineering|
| 2| Marketing |
| 4| HR |
| 6| Sales |
+---+-----------+
🔸 Example 1: INNER JOIN
df_inner = df_people.join(df_dept, on="id", how="inner")
df_inner.show()
Output: Only matching records
+---+-------------+-----------+
| id| name| department|
+---+-------------+-----------+
| 1| Aamir Shahzad|Engineering|
| 2| Ali Raza| Marketing |
| 4| Lisa| HR |
+---+-------------+-----------+
🔸 Example 2: LEFT JOIN
df_left = df_people.join(df_dept, df_people.id == df_dept.id, how="left")
df_left.show()
Output: All people with departments if available
🔸 Example 3: RIGHT JOIN
df_right = df_people.join(df_dept, on="id", how="right")
df_right.show()
Output: All departments with people if available
🔸 Example 4: FULL OUTER JOIN
df_outer = df_people.join(df_dept, on="id", how="outer")
df_outer.show()
Output: All records from both DataFrames
+---+-------------+-----------+
| id| name| department|
+---+-------------+-----------+
| 1| Aamir Shahzad|Engineering|
| 2| Ali Raza| Marketing |
| 3| Bob| null|
| 4| Lisa| HR |
| 5| John| null|
| 6| null| Sales |
+---+-------------+-----------+
No comments:
Post a Comment