PySpark Tutorial_ Optimize Your Data with repartitionByRange() in PySpark | PySpark Tutorial

Optimize Data with repartitionByRange() in PySpark | PySpark Tutorial

Optimize Your Data with repartitionByRange() in PySpark

The repartitionByRange() function in PySpark is used for range-based repartitioning of data. It groups records based on specified column values and ensures even distribution for better performance.

Step 1: Create Sample DataFrame

data = [
  (1, "Aamir Shahzad"),
  (2, "Ali Raza"),
  (3, "Bob"),
  (4, "Lisa"),
  (5, "Ali Raza"),
  (6, "Aamir Shahzad"),
  (7, "Lisa"),
  (8, "Bob"),
  (9, "Aamir Shahzad"),
  (10, "Ali Raza")
]

df = spark.createDataFrame(data, ["id", "name"])

print("📌 Original DataFrame:")
df.show()

Step 2: Check Original Number of Partitions

original_partitions = df.rdd.getNumPartitions()
print(f"📊 Original Number of Partitions: {original_partitions}")

Step 3: Repartition by Range on 'id'

df_repartitioned = df.repartitionByRange(3, "id")

Step 4: Check Number of Partitions After Repartitioning

new_partitions = df_repartitioned.rdd.getNumPartitions()
print(f"📊 Number of Partitions after repartitionByRange: {new_partitions}")

Step 5: Add Partition Index Column to Inspect Distribution

from pyspark.sql.functions import spark_partition_id

df_with_partition_info = df_repartitioned.withColumn("partition_id", spark_partition_id())

print("📌 Partitioned Data Preview (Range Partitioned on id):")
df_with_partition_info.orderBy("id").show(truncate=False)

Summary

  • repartitionByRange() is ideal for range-based partitioning
  • Helps optimize performance for sorting, joins, and writes
  • Use sampling to estimate partition boundaries when needed
  • Useful in scenarios where hash repartitioning isn't efficient

📺 Watch the Full Tutorial

2 comments:

  1. Really appreciate this clear and practical breakdown of repartitionByRange()—super helpful for anyone optimizing large datasets in PySpark! 🙌 One question though: how does repartitionByRange() compare to repartition() in performance when working with skewed data?

    Valuable insights! Appreciate you sharing this.
    Psoriatic arthritis

    ReplyDelete
  2. Thanks for the clear and concise tutorial—repartitionByRange() makes much more sense now! The step-by-step breakdown and visuals really helped. Just curious: how does repartitionByRange() compare to repartition() when dealing with skewed data?

    Thank you for the insights, truly appreciated!!
    office refurbishment companies in delhi








    ReplyDelete