PySpark unpersist()
Explained – How to Free Memory in Spark
In this tutorial, you'll learn how to use the unpersist()
function in PySpark to release memory or disk used by persisted or cached DataFrames and RDDs. It is essential for optimizing memory and performance when working with large datasets in Spark.
Step 1: Create a Sample DataFrame
data = [("Aamir Shahzad",), ("Ali Raza",), ("Bob",), ("Lisa",)]
columns = ["name"]
df = spark.createDataFrame(data, columns)
print("📌 Original DataFrame:")
df.show()
Step 2: Persist the DataFrame
df.persist()
print("✅ DataFrame is now persisted (cached in memory).")
print(df.is_cached) # Output: True
Step 3: Unpersist the DataFrame
df_unpersisted = df.unpersist()
print("✅ DataFrame is now unpersisted.")
df_unpersisted.show()
Step 4: Persist with Blocking = True
df_unpersisted_blocking = df.unpersist(blocking=True)
print("✅ DataFrame is now unpersisted with blocking=True (waits for cleanup).")
Summary
persist()
stores a DataFrame in memory or disk to speed up performance.unpersist()
is used to clear cached/persisted data and release resources.- Always use
unpersist()
after you're done with cached data to avoid memory issues. - Use
blocking=True
if you want to wait for full cleanup before continuing.
No comments:
Post a Comment