PySpark Tutorial: How to Use crosstab() to Analyze Relationships Between Columns #databricks

PySpark Tutorial: Analyze Column Relationships Using crosstab()

PySpark Tutorial: How to Use crosstab() to Analyze Relationships Between Columns

This tutorial will show you how to use the crosstab() function in PySpark to create frequency tables and understand the relationship between two categorical columns.

1. What is crosstab() in PySpark?

The crosstab() function in PySpark generates a contingency table (cross-tabulation) between two columns. It counts the occurrences of combinations between two categorical variables.

2. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark Crosstab Example") \
    .getOrCreate()

3. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", "Pakistan"),
    (2, "Ali Raza", "Pakistan"),
    (3, "Bob", "USA"),
    (4, "Lisa", "Canada"),
    (5, "Aamir Shahzad", "Pakistan"),
    (6, "Ali Raza", "Pakistan"),
    (7, "Bob", "USA"),
    (8, "Lisa", "Canada"),
    (9, "Aamir Shahzad", "Pakistan"),
    (10, "Ali Raza", "USA"),
    (11, "Bob", "USA"),
    (12, "Lisa", "Canada")
]

columns = ["ID", "Name", "Country"]

df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()
+---+--------------+--------+
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Aamir Shahzad|Pakistan|
| 10| Ali Raza| USA|
| 11| Bob| USA|
| 12| Lisa| Canada|
+---+--------------+--------+

4. Apply crosstab() Between Name and Country

crosstab_df = df.crosstab("Name", "Country")

print("Crosstab between Name and Country:")
crosstab_df.show(truncate=False)
+----------------+------+-------+----+
|Name_Country |Canada|Pakistan|USA |
+----------------+------+--------+----+
|Aamir Shahzad |0 |3 |0 |
|Ali Raza |0 |2 |1 |
|Bob |0 |0 |3 |
|Lisa |3 |0 |0 |
+----------------+------+--------+----+

🎥 Watch the Full Video Tutorial

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.

1 comment:

  1. Great post, Aamir! This is a clear and concise explanation of how to use the `crosstab()` function in PySpark. The examples are really helpful in demonstrating how to create frequency tables. I'm curious, do you have any tips on how to handle larger datasets efficiently when using `crosstab()` in PySpark?









    Valuable insights! Appreciate you sharing this.
    sjögren's syndrome treatment

    ReplyDelete