How to Convert PySpark DataFrame to RDD Using .rdd | PySpark RDD vs DataFrame #pysparktutorial

How to Convert PySpark DataFrame to RDD Using .rdd

How to Access RDD from DataFrame Using .rdd in PySpark

The .rdd property lets you access the underlying RDD (Resilient Distributed Dataset) of a DataFrame. This is useful when you need low-level RDD operations or more control than the DataFrame API provides.

Step 1: Create a Sample DataFrame

data = [
    ("Aamir Shahzad", "Engineering", 100000),
    ("Ali Raza", "HR", 70000),
    ("Bob", "Engineering", 80000),
    ("Lisa", "Marketing", 65000)
]

columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)

print("📌 Original DataFrame:")
df.show()

Step 2: Convert DataFrame to RDD

rdd_from_df = df.rdd

print("📌 Type of rdd_from_df:")
print(type(rdd_from_df))  # Output: <class 'pyspark.rdd.RDD'>

Step 3: Print RDD Contents

print("📌 RDD Contents:")
for row in rdd_from_df.collect():
    print(row)

Step 4: Use RDD Transformations

# Extract just names from RDD
name_rdd = df.rdd.map(lambda row: row['name'])

print("📌 Names from RDD:")
print(name_rdd.collect())

Summary

✅ Use .rdd to convert a DataFrame to an RDD of Row objects.
✅ You can then apply RDD operations like map, filter, and reduce.
✅ Useful when you need more flexibility than the DataFrame API allows.

📺 Watch the Full Tutorial

No comments:

Post a Comment