PySpark unpersist() Explained | Free Up Memory in Spark

PySpark `unpersist()` Explained – How to Free Memory in Spark

In this tutorial, you'll learn how to use the unpersist() function in PySpark to release memory or disk used by persisted or cached DataFrames and RDDs. It is essential for optimizing memory and performance when working with large datasets in Spark.

Step 1: Create a Sample DataFrame

data = [("Aamir Shahzad",), ("Ali Raza",), ("Bob",), ("Lisa",)]
columns = ["name"]
df = spark.createDataFrame(data, columns)

print("📌 Original DataFrame:")
df.show()

Step 2: Persist the DataFrame

df.persist()
print("✅ DataFrame is now persisted (cached in memory).")
print(df.is_cached)  # Output: True

Step 3: Unpersist the DataFrame

df_unpersisted = df.unpersist()
print("✅ DataFrame is now unpersisted.")
df_unpersisted.show()

Step 4: Persist with Blocking = True

df_unpersisted_blocking = df.unpersist(blocking=True)
print("✅ DataFrame is now unpersisted with blocking=True (waits for cleanup).")

Summary

persist() stores a DataFrame in memory or disk to speed up performance.
unpersist() is used to clear cached/persisted data and release resources.
Always use unpersist() after you're done with cached data to avoid memory issues.
Use blocking=True if you want to wait for full cleanup before continuing.

Welcome To TechBrothersIT

Label

PySpark unpersist Explained How to Free Memory with unpersist Function With Examples #pyspark