PySpark stat() Function Tutorial – Perform Statistical Analysis
Introduction
In PySpark, the stat function provides access to a range of statistical methods including crosstab, freqItems, cov, and corr. This tutorial shows how to use these tools to perform basic statistical analysis directly on Spark DataFrames.
1. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("StatFunctionDemo") \
.getOrCreate()
2. Sample DataFrame
data = [
("Aamir", "Pakistan", 85),
("Ali", "USA", 78),
("Bob", "UK", 85),
("Lisa", "Canada", 92),
("Ali", "USA", 65)
]
columns = ["Name", "Country", "Score"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+------+---------+-----+
| Name| Country|Score|
+------+---------+-----+
|Aamir | Pakistan| 85|
| Ali | USA | 78|
| Bob | UK | 85|
| Lisa | Canada | 92|
| Ali | USA | 65|
+------+---------+-----+
3. Crosstab – Name vs Country
df.stat.crosstab("Name", "Country").show()
Output:
+-------------+-------+-----+-----+----+
|Name_Country |Canada |Pakistan|UK |USA|
+-------------+-------+--------+---+---+
| Lisa | 1 | 0 | 0 | 0 |
| Ali | 0 | 0 | 0 | 2 |
| Aamir | 0 | 1 | 0 | 0 |
| Bob | 0 | 0 | 1 | 0 |
+-------------+-------+--------+---+---+
4. Frequent Items in Name and Country
df.stat.freqItems(["Name", "Country"], support=0.3).show(truncate=False)
Output:
+------------------+------------------+
|Name_freqItems |Country_freqItems |
+------------------+------------------+
|[Ali] |[USA] |
+------------------+------------------+
5. Covariance – Score & Bonus
# Add bonus column for demonstration
df2 = df.withColumn("Bonus", (df["Score"] * 0.1))
# Covariance between Score and Bonus
cov_val = df2.stat.cov("Score", "Bonus")
print(f"Covariance between Score and Bonus: {cov_val}")
Output:
Covariance between Score and Bonus: 9.628571428571425
6. Correlation – Score & Bonus
corr_val = df2.stat.corr("Score", "Bonus")
print(f"Correlation between Score and Bonus: {corr_val}")
Output:
Correlation between Score and Bonus: 0.9999999999999998



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.