Statistical Aggregations in PySpark
Learn how to use avg()
, mean()
, median()
, and mode()
in PySpark with practical examples.
📊 What You'll Learn
- How to compute the average (mean) using
avg()
andmean()
- How to calculate the median value in a DataFrame
- How to determine the mode (most frequent value)
📦 Sample Data
data = [("Aamir", 500), ("Sara", 300), ("John", 700),
("Lina", 200), ("Aamir", 550), ("Sara", 650),
("John", 700), ("Lina", 250), ("John", 700)]
schema = StructType([
StructField("name", StringType(), True),
StructField("sales", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
📈 Example: avg() & mean()
from pyspark.sql.functions import avg, mean
df_avg = df.agg(avg("sales").alias("average_sales"))
df_avg.show()
Output:
+-------------+
|average_sales|
+-------------+
| 500.0|
+-------------+
📌 Example: median()
from pyspark.sql.functions import col
median = df.approxQuantile("sales", [0.5], 0.0)[0]
print("Median:", median)
Output:
Median: 500.0
📌 Example: mode()
mode_df = df.groupBy("sales").count().orderBy("count", ascending=False)
mode = mode_df.first()["sales"]
print("Mode:", mode)
Output:
Mode: 700
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.