PySpark stat() Function Tutorial – Perform Statistical Analysis

PySpark `stat()` Function Tutorial – Perform Statistical Analysis

Introduction

In PySpark, the stat function provides access to a range of statistical methods including crosstab, freqItems, cov, and corr. This tutorial shows how to use these tools to perform basic statistical analysis directly on Spark DataFrames.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("StatFunctionDemo") \
    .getOrCreate()

2. Sample DataFrame

data = [
    ("Aamir", "Pakistan", 85),
    ("Ali", "USA", 78),
    ("Bob", "UK", 85),
    ("Lisa", "Canada", 92),
    ("Ali", "USA", 65)
]

columns = ["Name", "Country", "Score"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+------+---------+-----+
|  Name|  Country|Score|
+------+---------+-----+
|Aamir | Pakistan|   85|
| Ali  |     USA |   78|
| Bob  |     UK  |   85|
| Lisa |  Canada |   92|
| Ali  |     USA |   65|
+------+---------+-----+

3. Crosstab – Name vs Country

df.stat.crosstab("Name", "Country").show()

Output:

+-------------+-------+-----+-----+----+
|Name_Country |Canada |Pakistan|UK |USA|
+-------------+-------+--------+---+---+
|    Lisa     |   1   |   0    | 0 | 0 |
|    Ali      |   0   |   0    | 0 | 2 |
|   Aamir     |   0   |   1    | 0 | 0 |
|    Bob      |   0   |   0    | 1 | 0 |
+-------------+-------+--------+---+---+

4. Frequent Items in Name and Country

df.stat.freqItems(["Name", "Country"], support=0.3).show(truncate=False)

Output:

+------------------+------------------+
|Name_freqItems    |Country_freqItems |
+------------------+------------------+
|[Ali]             |[USA]             |
+------------------+------------------+

5. Covariance – Score & Bonus

# Add bonus column for demonstration
df2 = df.withColumn("Bonus", (df["Score"] * 0.1))

# Covariance between Score and Bonus
cov_val = df2.stat.cov("Score", "Bonus")
print(f"Covariance between Score and Bonus: {cov_val}")

Output:

Covariance between Score and Bonus: 9.628571428571425

6. Correlation – Score & Bonus

corr_val = df2.stat.corr("Score", "Bonus")
print(f"Correlation between Score and Bonus: {corr_val}")

Output:

Correlation between Score and Bonus: 0.9999999999999998

🎥 Watch Full Video Tutorial

1 comment:

Epoddar - Buy Unboxed and Refurbished Laptops, Mobiles and CameraMay 17, 2025 at 12:21 PM
Great tutorial—clear, practical, and easy to follow! The examples really help make PySpark’s `stat()` functions more approachable for beginners. Thanks for breaking it down so well. Just curious—how does performance scale with these statistical methods on very large datasets in production?
Appreciate the valuable insights—thank you!
refurbished laptop computers

Welcome To TechBrothersIT

Label

PySpark Tutorial : PySpark stat Function Tutorial Perform Statistical Analysis on DataFrames Easily #pyspark

PySpark `stat()` Function Tutorial – Perform Statistical Analysis

Introduction

1. Create Spark Session

2. Sample DataFrame

Output:

3. Crosstab – Name vs Country

Output:

4. Frequent Items in Name and Country

Output:

5. Covariance – Score & Bonus

Output:

6. Correlation – Score & Bonus

Output:

🎥 Watch Full Video Tutorial

1 comment: