PySpark Tutorial_ Optimize Your Data with repartitionByRange() in PySpark | PySpark Tutorial

Optimize Data with repartitionByRange() in PySpark | PySpark Tutorial

Optimize Your Data with repartitionByRange() in PySpark

The repartitionByRange() function in PySpark is used for range-based repartitioning of data. It groups records based on specified column values and ensures even distribution for better performance.

Step 1: Create Sample DataFrame

data = [
  (1, "Aamir Shahzad"),
  (2, "Ali Raza"),
  (3, "Bob"),
  (4, "Lisa"),
  (5, "Ali Raza"),
  (6, "Aamir Shahzad"),
  (7, "Lisa"),
  (8, "Bob"),
  (9, "Aamir Shahzad"),
  (10, "Ali Raza")
]

df = spark.createDataFrame(data, ["id", "name"])

print("📌 Original DataFrame:")
df.show()

Step 2: Check Original Number of Partitions

original_partitions = df.rdd.getNumPartitions()
print(f"📊 Original Number of Partitions: {original_partitions}")

Step 3: Repartition by Range on 'id'

df_repartitioned = df.repartitionByRange(3, "id")

Step 4: Check Number of Partitions After Repartitioning

new_partitions = df_repartitioned.rdd.getNumPartitions()
print(f"📊 Number of Partitions after repartitionByRange: {new_partitions}")

Step 5: Add Partition Index Column to Inspect Distribution

from pyspark.sql.functions import spark_partition_id

df_with_partition_info = df_repartitioned.withColumn("partition_id", spark_partition_id())

print("📌 Partitioned Data Preview (Range Partitioned on id):")
df_with_partition_info.orderBy("id").show(truncate=False)

Summary

  • repartitionByRange() is ideal for range-based partitioning
  • Helps optimize performance for sorting, joins, and writes
  • Use sampling to estimate partition boundaries when needed
  • Useful in scenarios where hash repartitioning isn't efficient

📺 Watch the Full Tutorial

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.