How to Use approxQuantile() in PySpark | Quick Guide to Percentiles & Median

How to Use approxQuantile() in PySpark

Quick Guide to Percentiles & Median

The approxQuantile() function in PySpark helps you estimate percentiles and median values quickly and efficiently. This is especially useful for large datasets when full scans are costly.

1. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("approxQuantile Example") \
    .getOrCreate()

2. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28),
    (5, "John", 40),
    (6, "Sara", 50)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()

+---+--------------+---+ | id| name|age| +---+--------------+---+ | 1| Aamir Shahzad| 35| | 2| Ali Raza| 30| | 3| Bob| 25| | 4| Lisa| 28| | 5| John| 40| | 6| Sara| 50| +---+--------------+---+

3. Use approxQuantile()

Example 1: Median (50th percentile)

median_age = df.approxQuantile("age", [0.5], 0.01)
print("Median Age:", median_age)

Median Age: [30.0]

Example 2: 25th, 50th, and 75th Percentiles

quantiles = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.01)
print("25th, 50th, and 75th Percentiles:", quantiles)

25th, 50th, and 75th Percentiles: [28.0, 30.0, 40.0]

Example 3: Min, Median, Max

min_median_max = df.approxQuantile("age", [0.0, 0.5, 1.0], 0.01)
print("Min, Median, and Max Age:", min_median_max)

Min, Median, and Max Age: [25.0, 30.0, 50.0]

4. Control Accuracy with relativeError

# Lower relativeError = more accurate but slower
# Higher relativeError = less accurate but faster

# Example: Set relativeError to 0.1 (faster but less accurate)
quantiles_fast = df.approxQuantile("age", [0.25, 0.5, 0.75], 0.1)
print("Quantiles with higher relative error:", quantiles_fast)

Quantiles with higher relative error: [28.0, 30.0, 40.0]

📺 Watch Full Tutorial Video

▶️ Watch on YouTube

3 comments:

umeshMay 2, 2025 at 10:33 PM
Great explanation on using approxQuantile() in PySpark! It's super helpful when dealing with large datasets where performance is key and exact quantiles aren't necessary. I especially appreciated the breakdown of parameters—cleared up a lot of confusion for me.

On a related note, for anyone looking to upgrade their data engineering or QA skills alongside PySpark, I recently came across a Best Software Testing Training in Hyderabad program that blends practical and theoretical learning really well. Worth checking out if you're into data pipelines and quality assurance!

umeshMay 2, 2025 at 11:04 PM
Great explanation on using approxQuantile() in PySpark — this function is incredibly useful when dealing with large datasets where performance is critical and exact quantiles aren't necessary. I especially appreciate how you broke down the syntax and included practical examples. It’s a solid resource for data engineers and analysts alike.

On a related note, for those looking to broaden their skillset beyond data engineering, especially in the area of software quality assurance, I highly recommend checking out this Manual Testing Online Training in Hyderabad . It’s a great resource for both beginners and experienced professionals looking to upskill.

umeshMay 2, 2025 at 11:07 PM
Great explanation on using approxQuantile() in PySpark! It's especially useful when working with massive datasets where performance is critical and exact quantiles are computationally expensive. I appreciated the way you broke down the parameters and use cases—really helpful for someone optimizing big data pipelines.

For anyone working with test automation alongside data engineering, especially in dynamic environments, learning robust tools like Selenium is a game-changer. I recently came across a great resource for Best Selenium Online Training in Hyderabad that complements data engineering workflows by improving testing efficiency. Thanks again for the great insights on PySpark!

Welcome To TechBrothersIT

Label