Optimize Your Data with repartitionByRange()
in PySpark
The repartitionByRange()
function in PySpark is used for range-based repartitioning of data. It groups records based on specified column values and ensures even distribution for better performance.
Step 1: Create Sample DataFrame
data = [
(1, "Aamir Shahzad"),
(2, "Ali Raza"),
(3, "Bob"),
(4, "Lisa"),
(5, "Ali Raza"),
(6, "Aamir Shahzad"),
(7, "Lisa"),
(8, "Bob"),
(9, "Aamir Shahzad"),
(10, "Ali Raza")
]
df = spark.createDataFrame(data, ["id", "name"])
print("📌 Original DataFrame:")
df.show()
Step 2: Check Original Number of Partitions
original_partitions = df.rdd.getNumPartitions()
print(f"📊 Original Number of Partitions: {original_partitions}")
Step 3: Repartition by Range on 'id'
df_repartitioned = df.repartitionByRange(3, "id")
Step 4: Check Number of Partitions After Repartitioning
new_partitions = df_repartitioned.rdd.getNumPartitions()
print(f"📊 Number of Partitions after repartitionByRange: {new_partitions}")
Step 5: Add Partition Index Column to Inspect Distribution
from pyspark.sql.functions import spark_partition_id
df_with_partition_info = df_repartitioned.withColumn("partition_id", spark_partition_id())
print("📌 Partitioned Data Preview (Range Partitioned on id):")
df_with_partition_info.orderBy("id").show(truncate=False)
Summary
repartitionByRange()
is ideal for range-based partitioning- Helps optimize performance for sorting, joins, and writes
- Use sampling to estimate partition boundaries when needed
- Useful in scenarios where hash repartitioning isn't efficient
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.