Statistical Aggregations in PySpark | avg(), mean(), median(), mode() | PySpark Tutorial

Statistical Aggregations in PySpark | avg(), mean(), median(), mode()

Statistical Aggregations in PySpark

Learn how to use avg(), mean(), median(), and mode() in PySpark with practical examples.

📊 What You'll Learn

  • How to compute the average (mean) using avg() and mean()
  • How to calculate the median value in a DataFrame
  • How to determine the mode (most frequent value)

📦 Sample Data


data = [("Aamir", 500), ("Sara", 300), ("John", 700),
        ("Lina", 200), ("Aamir", 550), ("Sara", 650),
        ("John", 700), ("Lina", 250), ("John", 700)]
schema = StructType([
    StructField("name", StringType(), True),
    StructField("sales", IntegerType(), True)
])
df = spark.createDataFrame(data, schema)
    

📈 Example: avg() & mean()


from pyspark.sql.functions import avg, mean

df_avg = df.agg(avg("sales").alias("average_sales"))
df_avg.show()
    
Output:

+-------------+
|average_sales|
+-------------+
|        500.0|
+-------------+
    

📌 Example: median()


from pyspark.sql.functions import col

median = df.approxQuantile("sales", [0.5], 0.0)[0]
print("Median:", median)
    
Output:

Median: 500.0
    

📌 Example: mode()


mode_df = df.groupBy("sales").count().orderBy("count", ascending=False)
mode = mode_df.first()["sales"]
print("Mode:", mode)
    
Output:

Mode: 700
    

🎥 Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.