PySpark stat()
Function Tutorial – Perform Statistical Analysis
Introduction
In PySpark, the stat
function provides access to a range of statistical methods including crosstab
, freqItems
, cov
, and corr
. This tutorial shows how to use these tools to perform basic statistical analysis directly on Spark DataFrames.
1. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("StatFunctionDemo") \
.getOrCreate()
2. Sample DataFrame
data = [
("Aamir", "Pakistan", 85),
("Ali", "USA", 78),
("Bob", "UK", 85),
("Lisa", "Canada", 92),
("Ali", "USA", 65)
]
columns = ["Name", "Country", "Score"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+------+---------+-----+
| Name| Country|Score|
+------+---------+-----+
|Aamir | Pakistan| 85|
| Ali | USA | 78|
| Bob | UK | 85|
| Lisa | Canada | 92|
| Ali | USA | 65|
+------+---------+-----+
3. Crosstab – Name vs Country
df.stat.crosstab("Name", "Country").show()
Output:
+-------------+-------+-----+-----+----+
|Name_Country |Canada |Pakistan|UK |USA|
+-------------+-------+--------+---+---+
| Lisa | 1 | 0 | 0 | 0 |
| Ali | 0 | 0 | 0 | 2 |
| Aamir | 0 | 1 | 0 | 0 |
| Bob | 0 | 0 | 1 | 0 |
+-------------+-------+--------+---+---+
4. Frequent Items in Name and Country
df.stat.freqItems(["Name", "Country"], support=0.3).show(truncate=False)
Output:
+------------------+------------------+
|Name_freqItems |Country_freqItems |
+------------------+------------------+
|[Ali] |[USA] |
+------------------+------------------+
5. Covariance – Score & Bonus
# Add bonus column for demonstration
df2 = df.withColumn("Bonus", (df["Score"] * 0.1))
# Covariance between Score and Bonus
cov_val = df2.stat.cov("Score", "Bonus")
print(f"Covariance between Score and Bonus: {cov_val}")
Output:
Covariance between Score and Bonus: 9.628571428571425
6. Correlation – Score & Bonus
corr_val = df2.stat.corr("Score", "Bonus")
print(f"Correlation between Score and Bonus: {corr_val}")
Output:
Correlation between Score and Bonus: 0.9999999999999998
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.