Using offset() in PySpark
The offset()
function is used to skip a number of rows before returning results from a DataFrame. It is commonly used with orderBy()
and limit()
for pagination or partial data fetching.
Step 1: Create Sample Data
data = [
("Aamir Shahzad", "Engineering", 100000),
("Ali Raza", "HR", 70000),
("Bob", "Engineering", 80000),
("Lisa", "Marketing", 65000),
("Aamir Shahzad", "Engineering", 95000),
("Ali Raza", "HR", 72000),
("Bob", "Engineering", 85000),
("Lisa", "Marketing", 66000)
]
columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)
df.show()
Step 2: Use offset() with orderBy() and limit()
# Example: Skip first 2 rows after sorting by salary
paginated_df = df.orderBy("salary", ascending=True).offset(2).limit(3)
print("📊 Paginated Result (Skip 2, Take 3):")
paginated_df.show()
Use Cases of offset()
- Pagination large DataFrames
- Explore partial results
- Implement front-end-style pagination in reports
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.