PySpark coalesce() Function Tutorial - Optimize Partitioning for Faster Spark Jobs #pysparktutorial

PySpark coalesce() Function Tutorial: Optimize Partitioning for Faster Spark Jobs

PySpark coalesce() Function Tutorial: Optimize Partitioning for Faster Spark Jobs

This tutorial will help you understand how to use the coalesce() function in PySpark to reduce the number of partitions in your DataFrame and improve performance.

1. What is coalesce() in PySpark?

  • coalesce() reduces the number of partitions in a DataFrame.
  • It is preferred over repartition() when reducing partitions because it avoids full data shuffle.
  • Ideal for optimizing small files and preparing data for output operations.

2. Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark coalesce() Example") \\
    .getOrCreate()

3. Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", 35),
    (2, "Ali Raza", 30),
    (3, "Bob", 25),
    (4, "Lisa", 28)
]

columns = ["id", "name", "age"]

df = spark.createDataFrame(data, columns)
df.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

4. Check Number of Partitions Before coalesce()

print("Partitions before coalesce:", df.rdd.getNumPartitions())
Partitions before coalesce: 4

5. Apply coalesce() to Reduce Partitions

df_coalesced = df.coalesce(1)

6. Check Number of Partitions After coalesce()

print("Partitions after coalesce:", df_coalesced.rdd.getNumPartitions())
Partitions after coalesce: 1

7. Show Transformed Data

df_coalesced.show()
+---+--------------+---+
| id| name|age|
+---+--------------+---+
| 1| Aamir Shahzad| 35|
| 2| Ali Raza| 30|
| 3| Bob| 25|
| 4| Lisa| 28|
+---+--------------+---+

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Author: Aamir Shahzad | TechBrothersIT

© 2025 PySpark Tutorials. All rights reserved.

1 comment:

  1. O bônus de R$ 7.500 da Parimatch é real, sim! Claro que é preciso depositar um valor maior e cumprir o rollover de 40x, mas eu fiz isso aos poucos, jogando cassino ao vivo e alguns slots que contam pro bônus. Acesse BR Login e resgate seu bônus de cadastro ao se registrar! O site é bem claro nas regras, e o suporte no chat tirou todas as minhas dúvidas. É uma das melhores promoções de boas-vindas para quem curte cassino de verdade.

    ReplyDelete