How to Convert PySpark DataFrame to RDD Using .rdd

How to Access RDD from DataFrame Using `.rdd` in PySpark

The .rdd property lets you access the underlying RDD (Resilient Distributed Dataset) of a DataFrame. This is useful when you need low-level RDD operations or more control than the DataFrame API provides.

Step 1: Create a Sample DataFrame

data = [
    ("Aamir Shahzad", "Engineering", 100000),
    ("Ali Raza", "HR", 70000),
    ("Bob", "Engineering", 80000),
    ("Lisa", "Marketing", 65000)
]

columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)

print("📌 Original DataFrame:")
df.show()

Step 2: Convert DataFrame to RDD

rdd_from_df = df.rdd

print("📌 Type of rdd_from_df:")
print(type(rdd_from_df))  # Output: <class 'pyspark.rdd.RDD'>

Step 3: Print RDD Contents

print("📌 RDD Contents:")
for row in rdd_from_df.collect():
    print(row)

Step 4: Use RDD Transformations

# Extract just names from RDD
name_rdd = df.rdd.map(lambda row: row['name'])

print("📌 Names from RDD:")
print(name_rdd.collect())

Summary

✅ Use .rdd to convert a DataFrame to an RDD of Row objects.
✅ You can then apply RDD operations like map, filter, and reduce.
✅ Useful when you need more flexibility than the DataFrame API allows.

Welcome To TechBrothersIT

Label

How to Convert PySpark DataFrame to RDD Using .rdd | PySpark RDD vs DataFrame #pysparktutorial