PySpark Tutorial : PySpark stat Function Tutorial Perform Statistical Analysis on DataFrames Easily #pyspark

PySpark stat() Function Tutorial – Perform Statistical Analysis

PySpark stat() Function Tutorial – Perform Statistical Analysis

Introduction

In PySpark, the stat function provides access to a range of statistical methods including crosstab, freqItems, cov, and corr. This tutorial shows how to use these tools to perform basic statistical analysis directly on Spark DataFrames.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("StatFunctionDemo") \
    .getOrCreate()

2. Sample DataFrame

data = [
    ("Aamir", "Pakistan", 85),
    ("Ali", "USA", 78),
    ("Bob", "UK", 85),
    ("Lisa", "Canada", 92),
    ("Ali", "USA", 65)
]

columns = ["Name", "Country", "Score"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+------+---------+-----+
|  Name|  Country|Score|
+------+---------+-----+
|Aamir | Pakistan|   85|
| Ali  |     USA |   78|
| Bob  |     UK  |   85|
| Lisa |  Canada |   92|
| Ali  |     USA |   65|
+------+---------+-----+

3. Crosstab – Name vs Country

df.stat.crosstab("Name", "Country").show()

Output:

+-------------+-------+-----+-----+----+
|Name_Country |Canada |Pakistan|UK |USA|
+-------------+-------+--------+---+---+
|    Lisa     |   1   |   0    | 0 | 0 |
|    Ali      |   0   |   0    | 0 | 2 |
|   Aamir     |   0   |   1    | 0 | 0 |
|    Bob      |   0   |   0    | 1 | 0 |
+-------------+-------+--------+---+---+

4. Frequent Items in Name and Country

df.stat.freqItems(["Name", "Country"], support=0.3).show(truncate=False)

Output:

+------------------+------------------+
|Name_freqItems    |Country_freqItems |
+------------------+------------------+
|[Ali]             |[USA]             |
+------------------+------------------+

5. Covariance – Score & Bonus

# Add bonus column for demonstration
df2 = df.withColumn("Bonus", (df["Score"] * 0.1))

# Covariance between Score and Bonus
cov_val = df2.stat.cov("Score", "Bonus")
print(f"Covariance between Score and Bonus: {cov_val}")

Output:

Covariance between Score and Bonus: 9.628571428571425

6. Correlation – Score & Bonus

corr_val = df2.stat.corr("Score", "Bonus")
print(f"Correlation between Score and Bonus: {corr_val}")

Output:

Correlation between Score and Bonus: 0.9999999999999998

🎥 Watch Full Video Tutorial

© 2025 Aamir Shahzad. All rights reserved.

1 comment:

  1. Great tutorial—clear, practical, and easy to follow! The examples really help make PySpark’s `stat()` functions more approachable for beginners. Thanks for breaking it down so well. Just curious—how does performance scale with these statistical methods on very large datasets in production?
    Appreciate the valuable insights—thank you!
    refurbished laptop computers

    ReplyDelete