RDD vs DataFrame in PySpark | Key Differences with Examples
Overview
In this tutorial, we explore the fundamental differences between RDD (Resilient Distributed Dataset) and DataFrame in PySpark. You will learn how both are used, when to prefer one over the other, and how their performance and schema-handling differ in real data engineering scenarios.
1️⃣ What is RDD?
RDD is a low-level object for distributed data processing. It's immutable, fault-tolerant, and supports functional transformations using methods like map()
, flatMap()
, filter()
.
rdd = spark.sparkContext.parallelize([("Alice", 30), ("Bob", 25)])
print(rdd.collect())
2️⃣ What is DataFrame?
DataFrame is a distributed collection of data organized into named columns, like a table. It's optimized via Catalyst and Tungsten engines for performance, and offers SQL support.
from pyspark.sql import Row
data = [Row(name="Alice", age=30), Row(name="Bob", age=25)]
df = spark.createDataFrame(data)
df.show()
3️⃣ RDD vs DataFrame Comparison Table
Feature | RDD | DataFrame |
---|---|---|
Schema | Not enforced | Enforced (Column names/types) |
Performance | Slower | Optimized (Catalyst) |
Ease of Use | Harder (more code) | Easier (SQL & APIs) |
Use Case | Complex transformations, unstructured data | Structured data, analytics, ML pipelines |
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.