PySpark offset() Function: How to Skip Rows in Spark DataFrame | PySpark Tutorial for Beginners

PySpark offset() Function: Skip Rows in Spark DataFrame

Using offset() in PySpark

The offset() function is used to skip a number of rows before returning results from a DataFrame. It is commonly used with orderBy() and limit() for pagination or partial data fetching.

Step 1: Create Sample Data

data = [
  ("Aamir Shahzad", "Engineering", 100000),
  ("Ali Raza", "HR", 70000),
  ("Bob", "Engineering", 80000),
  ("Lisa", "Marketing", 65000),
  ("Aamir Shahzad", "Engineering", 95000),
  ("Ali Raza", "HR", 72000),
  ("Bob", "Engineering", 85000),
  ("Lisa", "Marketing", 66000)
]

columns = ["name", "department", "salary"]
df = spark.createDataFrame(data, columns)
df.show()

Step 2: Use offset() with orderBy() and limit()

# Example: Skip first 2 rows after sorting by salary
paginated_df = df.orderBy("salary", ascending=True).offset(2).limit(3)

print("📊 Paginated Result (Skip 2, Take 3):")
paginated_df.show()

Use Cases of offset()

  • Pagination large DataFrames
  • Explore partial results
  • Implement front-end-style pagination in reports

📺 Watch the Full Tutorial

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.