PySpark Joins Explained | Inner, Left, Right Join with Examples in PySpark DataFrames #pyspark | PySpark Tutorial

How to Use Joins in PySpark | Inner, Left, Right, and Outer Explained

How to Use Joins in PySpark: Inner, Left, Right & Outer Explained

In this tutorial, we’ll explore how different types of joins work in PySpark, including Inner Join, Left Join, Right Join, and Outer Join.

🔸 Step 1: Create Sample DataFrames


from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType

spark = SparkSession.builder.appName("PySparkJoinExample").getOrCreate()

data_people = [
    ("1", "Aamir Shahzad"),
    ("2", "Ali Raza"),
    ("3", "Bob"),
    ("4", "Lisa"),
    ("5", "John"),
    ("6", None)
]

schema_people = StructType([
    StructField("id", StringType(), True),
    StructField("name", StringType(), True)
])

df_people = spark.createDataFrame(data_people, schema_people)
df_people.show()

Output:


+---+-------------+
| id|         name|
+---+-------------+
|  1| Aamir Shahzad|
|  2|      Ali Raza|
|  3|           Bob|
|  4|          Lisa|
|  5|          John|
|  6|          null|
+---+-------------+

🔸 Step 2: Create Department DataFrame


data_dept = [
    ("1", "Engineering"),
    ("2", "Marketing"),
    ("4", "HR"),
    ("6", "Sales")
]

schema_dept = StructType([
    StructField("id", StringType(), True),
    StructField("department", StringType(), True)
])

df_dept = spark.createDataFrame(data_dept, schema_dept)
df_dept.show()

Output:


+---+-----------+
| id| department|
+---+-----------+
|  1|Engineering|
|  2| Marketing |
|  4|        HR |
|  6|     Sales |
+---+-----------+

🔸 Example 1: INNER JOIN


df_inner = df_people.join(df_dept, on="id", how="inner")
df_inner.show()

Output: Only matching records


+---+-------------+-----------+
| id|         name| department|
+---+-------------+-----------+
|  1| Aamir Shahzad|Engineering|
|  2|      Ali Raza| Marketing |
|  4|          Lisa|        HR |
+---+-------------+-----------+

🔸 Example 2: LEFT JOIN


df_left = df_people.join(df_dept, df_people.id == df_dept.id, how="left")
df_left.show()

Output: All people with departments if available

🔸 Example 3: RIGHT JOIN


df_right = df_people.join(df_dept, on="id", how="right")
df_right.show()

Output: All departments with people if available

🔸 Example 4: FULL OUTER JOIN


df_outer = df_people.join(df_dept, on="id", how="outer")
df_outer.show()

Output: All records from both DataFrames


+---+-------------+-----------+
| id|         name| department|
+---+-------------+-----------+
|  1| Aamir Shahzad|Engineering|
|  2|      Ali Raza| Marketing |
|  3|           Bob|       null|
|  4|          Lisa|        HR |
|  5|          John|       null|
|  6|          null|     Sales |
+---+-------------+-----------+

📺 Watch Full Tutorial

No comments:

Post a Comment