PySpark stat() Function Tutorial – Perform Statistical Analysis

PySpark `stat()` Function Tutorial – Perform Statistical Analysis

Introduction

In PySpark, the stat function provides access to a range of statistical methods including crosstab, freqItems, cov, and corr. This tutorial shows how to use these tools to perform basic statistical analysis directly on Spark DataFrames.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("StatFunctionDemo") \
    .getOrCreate()

2. Sample DataFrame

data = [
    ("Aamir", "Pakistan", 85),
    ("Ali", "USA", 78),
    ("Bob", "UK", 85),
    ("Lisa", "Canada", 92),
    ("Ali", "USA", 65)
]

columns = ["Name", "Country", "Score"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+------+---------+-----+
|  Name|  Country|Score|
+------+---------+-----+
|Aamir | Pakistan|   85|
| Ali  |     USA |   78|
| Bob  |     UK  |   85|
| Lisa |  Canada |   92|
| Ali  |     USA |   65|
+------+---------+-----+

3. Crosstab – Name vs Country

df.stat.crosstab("Name", "Country").show()

Output:

+-------------+-------+-----+-----+----+
|Name_Country |Canada |Pakistan|UK |USA|
+-------------+-------+--------+---+---+
|    Lisa     |   1   |   0    | 0 | 0 |
|    Ali      |   0   |   0    | 0 | 2 |
|   Aamir     |   0   |   1    | 0 | 0 |
|    Bob      |   0   |   0    | 1 | 0 |
+-------------+-------+--------+---+---+

4. Frequent Items in Name and Country

df.stat.freqItems(["Name", "Country"], support=0.3).show(truncate=False)

Output:

+------------------+------------------+
|Name_freqItems    |Country_freqItems |
+------------------+------------------+
|[Ali]             |[USA]             |
+------------------+------------------+

5. Covariance – Score & Bonus

# Add bonus column for demonstration
df2 = df.withColumn("Bonus", (df["Score"] * 0.1))

# Covariance between Score and Bonus
cov_val = df2.stat.cov("Score", "Bonus")
print(f"Covariance between Score and Bonus: {cov_val}")

Output:

Covariance between Score and Bonus: 9.628571428571425

6. Correlation – Score & Bonus

corr_val = df2.stat.corr("Score", "Bonus")
print(f"Correlation between Score and Bonus: {corr_val}")

Output:

Correlation between Score and Bonus: 0.9999999999999998

Welcome To TechBrothersIT

Label

PySpark Tutorial : PySpark stat Function Tutorial Perform Statistical Analysis on DataFrames Easily #pyspark

PySpark `stat()` Function Tutorial – Perform Statistical Analysis

Introduction

1. Create Spark Session

2. Sample DataFrame

Output:

3. Crosstab – Name vs Country

Output:

4. Frequent Items in Name and Country

Output:

5. Covariance – Score & Bonus

Output:

6. Correlation – Score & Bonus

Output:

🎥 Watch Full Video Tutorial

No comments:

Post a Comment