How to Use persist() Function in PySpark | Cache vs Persist Explained

How to Use `persist()` Function in PySpark – Cache vs Persist Explained

The persist() function in PySpark is used to cache or store a DataFrame's intermediate results. It's especially useful when the same DataFrame is reused multiple times and you want to avoid recomputation for performance gains. Learn how it differs from cache() and how to use it effectively.

Step 1: Create a Sample DataFrame

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
  ("Aamir Shahzad", "Engineering", 100000),
  ("Ali Raza", "HR", 70000),
  ("Bob", "Engineering", 80000),
  ("Lisa", "Marketing", 65000)
]
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)

print("📌 Sample DataFrame:")
df.show()

Step 2: Use `persist()` to Store Intermediate Result

from pyspark.storagelevel import StorageLevel

df_cached = df.filter(df.salary > 70000).persist(StorageLevel.MEMORY_AND_DISK)
print("📌 Count of high earners (cached):", df_cached.count())
print("📌 Average salary (cached):", df_cached.groupBy().avg("salary").collect())

Step 3: Optionally Unpersist to Free Memory

df_cached.unpersist()

Summary

persist() allows finer control than cache() by letting you specify the storage level (memory, disk, or both).
Useful for long-running or reused DataFrames to reduce recomputation.
Always call unpersist() after you're done to free up resources.

Welcome To TechBrothersIT

Label

How to Use persist() Function in PySpark | PySpark Cache vs Persist Explained with Examples

How to Use `persist()` Function in PySpark – Cache vs Persist Explained

Step 1: Create a Sample DataFrame

Step 2: Use `persist()` to Store Intermediate Result

Step 3: Optionally Unpersist to Free Memory

Summary

📺 Watch the Full Tutorial

No comments:

Post a Comment

Label

How to Use persist() Function in PySpark | PySpark Cache vs Persist Explained with Examples

Step 1: Create a Sample DataFrame

Step 2: Use persist() to Store Intermediate Result

Step 3: Optionally Unpersist to Free Memory

Summary

📺 Watch the Full Tutorial

No comments:

Post a Comment

Step 2: Use `persist()` to Store Intermediate Result