PySpark Tutorial: How to Use crosstab() to Analyze Relationships Between Columns
This tutorial will show you how to use the crosstab()
function in PySpark to create frequency tables and understand the relationship between two categorical columns.
1. What is crosstab() in PySpark?
The crosstab()
function in PySpark generates a contingency table (cross-tabulation) between two columns. It counts the occurrences of combinations between two categorical variables.
2. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark Crosstab Example") \
.getOrCreate()
3. Create Sample DataFrame
data = [
(1, "Aamir Shahzad", "Pakistan"),
(2, "Ali Raza", "Pakistan"),
(3, "Bob", "USA"),
(4, "Lisa", "Canada"),
(5, "Aamir Shahzad", "Pakistan"),
(6, "Ali Raza", "Pakistan"),
(7, "Bob", "USA"),
(8, "Lisa", "Canada"),
(9, "Aamir Shahzad", "Pakistan"),
(10, "Ali Raza", "USA"),
(11, "Bob", "USA"),
(12, "Lisa", "Canada")
]
columns = ["ID", "Name", "Country"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
+---+--------------+--------+
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Aamir Shahzad|Pakistan|
| 10| Ali Raza| USA|
| 11| Bob| USA|
| 12| Lisa| Canada|
+---+--------------+--------+
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Aamir Shahzad|Pakistan|
| 10| Ali Raza| USA|
| 11| Bob| USA|
| 12| Lisa| Canada|
+---+--------------+--------+
4. Apply crosstab() Between Name and Country
crosstab_df = df.crosstab("Name", "Country")
print("Crosstab between Name and Country:")
crosstab_df.show(truncate=False)
+----------------+------+-------+----+
|Name_Country |Canada|Pakistan|USA |
+----------------+------+--------+----+
|Aamir Shahzad |0 |3 |0 |
|Ali Raza |0 |2 |1 |
|Bob |0 |0 |3 |
|Lisa |3 |0 |0 |
+----------------+------+--------+----+
|Name_Country |Canada|Pakistan|USA |
+----------------+------+--------+----+
|Aamir Shahzad |0 |3 |0 |
|Ali Raza |0 |2 |1 |
|Bob |0 |0 |3 |
|Lisa |3 |0 |0 |
+----------------+------+--------+----+
After a long session of coding with PySpark, I needed something fun to unwind—and a colleague suggested I download ricky casino. I was skeptical at first, but it turned out to be an incredibly smooth and entertaining experience. The app launches quickly, the UI is clean, and the game selection is broad. It’s great to have something relaxing to enjoy during breaks between coding projects. The installation process was easy and secure, and I’ve had no crashes or bugs so far. Definitely worth downloading if you want quality gaming that complements your tech-filled day.
ReplyDelete