Statistical Aggregations in PySpark | avg(), mean(), median(), mode()

Statistical Aggregations in PySpark

Learn how to use avg(), mean(), median(), and mode() in PySpark with practical examples.

📊 What You'll Learn

How to compute the average (mean) using avg() and mean()
How to calculate the median value in a DataFrame
How to determine the mode (most frequent value)

📦 Sample Data


data = [("Aamir", 500), ("Sara", 300), ("John", 700),
        ("Lina", 200), ("Aamir", 550), ("Sara", 650),
        ("John", 700), ("Lina", 250), ("John", 700)]
schema = StructType([
    StructField("name", StringType(), True),
    StructField("sales", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)

📈 Example: avg() & mean()


from pyspark.sql.functions import avg, mean

df_avg = df.agg(avg("sales").alias("average_sales"))
df_avg.show()

Output:


+-------------+
|average_sales|
+-------------+
|        500.0|
+-------------+

📌 Example: median()


from pyspark.sql.functions import col

median = df.approxQuantile("sales", [0.5], 0.0)[0]
print("Median:", median)

Output:


Median: 500.0

📌 Example: mode()


mode_df = df.groupBy("sales").count().orderBy("count", ascending=False)
mode = mode_df.first()["sales"]
print("Mode:", mode)

Output:


Mode: 700

Welcome To TechBrothersIT

Label

Statistical Aggregations in PySpark | avg(), mean(), median(), mode() | PySpark Tutorial

Statistical Aggregations in PySpark

📊 What You'll Learn

📦 Sample Data

📈 Example: avg() & mean()

📌 Example: median()

📌 Example: mode()

🎥 Video Tutorial

No comments:

Post a Comment