PySpark limit() Function Explained with Examples | Step-by-Step Guide

PySpark limit() Function Explained with Examples

The limit() function in PySpark is used to return a specified number of rows from a DataFrame. It helps in sampling data or fetching a small subset for quick analysis, especially useful for data engineers working with large datasets.

Sample Data

data = [
    (1, "Alice", 5000),
    (2, "Bob", 6000),
    (3, "Charlie", 7000),
    (4, "David", 8000),
    (5, "Eve", 9000),
    (6, "Frank", 10000),
    (7, "Grace", 11000),
    (8, "Hannah", 12000),
    (9, "Ian", 13000),
    (10, "Jack", 14000)
]

Create a DataFrame

df = spark.createDataFrame(data, ["id", "name", "salary"])

Show the Full DataFrame

df.show()

Example 1: Get the First 5 Rows

df.limit(5).show()

Example 2: Get the First 3 Rows

df.limit(3).show()

Example 3: Store the Limited DataFrame

df_limited = df.limit(4)
df_limited.show()

Welcome To TechBrothersIT

Label

PySpark Tutorial: limit() Function to Display Limited Rows | PySpark tutorial for Data Engineers