How to Access RDD from DataFrame Using .rdd
in PySpark
The .rdd
property lets you access the underlying RDD (Resilient Distributed Dataset) of a DataFrame. This is useful when you need low-level RDD operations or more control than the DataFrame API provides.
Step 1: Create a Sample DataFrame
data = [
("Aamir Shahzad", "Engineering", 100000),
("Ali Raza", "HR", 70000),
("Bob", "Engineering", 80000),
("Lisa", "Marketing", 65000)
]
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)
print("📌 Original DataFrame:")
df.show()
Step 2: Convert DataFrame to RDD
rdd_from_df = df.rdd
print("📌 Type of rdd_from_df:")
print(type(rdd_from_df)) # Output: <class 'pyspark.rdd.RDD'>
Step 3: Print RDD Contents
print("📌 RDD Contents:")
for row in rdd_from_df.collect():
print(row)
Step 4: Use RDD Transformations
# Extract just names from RDD
name_rdd = df.rdd.map(lambda row: row['name'])
print("📌 Names from RDD:")
print(name_rdd.collect())
Summary
✅ Use .rdd
to convert a DataFrame to an RDD of Row objects.
✅ You can then apply RDD operations like map
, filter
, and reduce
.
✅ Useful when you need more flexibility than the DataFrame API allows.
No comments:
Post a Comment