How to Use crossJoin() Function for Cartesian Product | PySpark Tutorial #pysparktutorial

PySpark crossJoin() Function | Cartesian Product of DataFrames

PySpark crossJoin() Function Tutorial

Cartesian Product of DataFrames in PySpark

In this tutorial, you will learn how to use the crossJoin() function in PySpark to generate a Cartesian product between two DataFrames. This operation combines every row of the first DataFrame with every row of the second one.

Step 1: Import and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark crossJoin Example") \
    .getOrCreate()

Step 2: Create Sample Data

# DataFrame 1: People
data_people = [
    ("Aamir Shahzad", "Pakistan"),
    ("Ali Raza", "USA"),
    ("Bob", "UK"),
    ("Lisa", "Canada")
]
df_people = spark.createDataFrame(data_people, ["Name", "Country"])

# DataFrame 2: Hobbies
data_hobbies = [
    ("Reading",),
    ("Traveling",),
    ("Cricket",)
]
df_hobbies = spark.createDataFrame(data_hobbies, ["Hobby"])

People DataFrame Output:

Name | Country
------------------------
Aamir Shahzad | Pakistan
Ali Raza | USA
Bob | UK
Lisa | Canada

Hobbies DataFrame Output:

Hobby
------
Reading
Traveling
Cricket

Step 3: Perform crossJoin()

# Perform Cartesian join
cross_join_result = df_people.crossJoin(df_hobbies)

# Show result
cross_join_result.show(truncate=False)

Output:

Name | Country | Hobby
------------------------------------
Aamir Shahzad | Pakistan | Reading
Aamir Shahzad | Pakistan | Traveling
Aamir Shahzad | Pakistan | Cricket
Ali Raza | USA | Reading
Ali Raza | USA | Traveling
Ali Raza | USA | Cricket
Bob | UK | Reading
Bob | UK | Traveling
Bob | UK | Cricket
Lisa | Canada | Reading
Lisa | Canada | Traveling
Lisa | Canada | Cricket

📺 Watch the Full Tutorial Video

Author: Aamir Shahzad

© 2025 PySpark Tutorials. All rights reserved.

No comments:

Post a Comment