PySpark stat()
Function Tutorial – Perform Statistical Analysis
Introduction
In PySpark, the stat
function provides access to a range of statistical methods including crosstab
, freqItems
, cov
, and corr
. This tutorial shows how to use these tools to perform basic statistical analysis directly on Spark DataFrames.
1. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("StatFunctionDemo") \
.getOrCreate()
2. Sample DataFrame
data = [
("Aamir", "Pakistan", 85),
("Ali", "USA", 78),
("Bob", "UK", 85),
("Lisa", "Canada", 92),
("Ali", "USA", 65)
]
columns = ["Name", "Country", "Score"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+------+---------+-----+
| Name| Country|Score|
+------+---------+-----+
|Aamir | Pakistan| 85|
| Ali | USA | 78|
| Bob | UK | 85|
| Lisa | Canada | 92|
| Ali | USA | 65|
+------+---------+-----+
3. Crosstab – Name vs Country
df.stat.crosstab("Name", "Country").show()
Output:
+-------------+-------+-----+-----+----+
|Name_Country |Canada |Pakistan|UK |USA|
+-------------+-------+--------+---+---+
| Lisa | 1 | 0 | 0 | 0 |
| Ali | 0 | 0 | 0 | 2 |
| Aamir | 0 | 1 | 0 | 0 |
| Bob | 0 | 0 | 1 | 0 |
+-------------+-------+--------+---+---+
4. Frequent Items in Name and Country
df.stat.freqItems(["Name", "Country"], support=0.3).show(truncate=False)
Output:
+------------------+------------------+
|Name_freqItems |Country_freqItems |
+------------------+------------------+
|[Ali] |[USA] |
+------------------+------------------+
5. Covariance – Score & Bonus
# Add bonus column for demonstration
df2 = df.withColumn("Bonus", (df["Score"] * 0.1))
# Covariance between Score and Bonus
cov_val = df2.stat.cov("Score", "Bonus")
print(f"Covariance between Score and Bonus: {cov_val}")
Output:
Covariance between Score and Bonus: 9.628571428571425
6. Correlation – Score & Bonus
corr_val = df2.stat.corr("Score", "Bonus")
print(f"Correlation between Score and Bonus: {corr_val}")
Output:
Correlation between Score and Bonus: 0.9999999999999998
Great tutorial—clear, practical, and easy to follow! The examples really help make PySpark’s `stat()` functions more approachable for beginners. Thanks for breaking it down so well. Just curious—how does performance scale with these statistical methods on very large datasets in production?
ReplyDeleteAppreciate the valuable insights—thank you!
refurbished laptop computers