How to Use first(), head(), and tail() Functions in PySpark | Step-by-Step Guide

How to Use first(), head(), and tail() Functions in PySpark

Author: Aamir Shahzad

Date: March 2025

Introduction

In PySpark, the functions first(), head(), and tail() are used to retrieve specific rows from a DataFrame. These functions are particularly useful for inspecting data, debugging, and performing quick checks.

Why Use These Functions?

first() returns the first row of the DataFrame.
head(n) returns the first n rows of the DataFrame as a list of Row objects.
tail(n) returns the last n rows of the DataFrame as a list of Row objects.

Step 1: Create SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySparkFirstHeadTailFunctions") \
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    ("Aamir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000),
    ("Raza", "Marketing", 3500),
    ("Bob", "Sales", 4200),
    ("Lisa", "Engineering", 6000)
]

columns = ["Name", "Department", "Salary"]

df = spark.createDataFrame(data, schema=columns)

df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+

Step 3: Using head() Function

# Get the first 3 rows using head()
head_rows = df.head(3)

# Print each row
for row in head_rows:
    print(row)

Expected Output

Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
Row(Name='Ali', Department='Sales', Salary=4000)
Row(Name='Raza', Department='Marketing', Salary=3500)

Step 4: Using first() Function

# Get the first row
first_row = df.first()

# Print the first row
print(first_row)

Expected Output

Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)

Step 5: Using tail() Function

# Get the last 2 rows
tail_rows = df.tail(2)

# Print each row
for row in tail_rows:
    print(row)

Expected Output

Row(Name='Bob', Department='Sales', Salary=4200)
Row(Name='Lisa', Department='Engineering', Salary=6000)

Conclusion

PySpark provides several functions to access rows in a DataFrame. first(), head(), and tail() are simple yet powerful tools for data inspection and debugging. Understanding their differences helps in retrieving data more effectively during data processing tasks.

Watch the Video Tutorial

For a complete walkthrough of first(), head(), and tail() functions in PySpark, check out this video tutorial:

Welcome To TechBrothersIT

Label

PySpark Tutorial: first(), head(), and tail() Functions Explained with Examples

How to Use first(), head(), and tail() Functions in PySpark

Introduction

Why Use These Functions?

Step 1: Create SparkSession

Step 2: Create a Sample DataFrame

Expected Output

Step 3: Using head() Function

Expected Output

Step 4: Using first() Function

Expected Output

Step 5: Using tail() Function

Expected Output

Conclusion

Watch the Video Tutorial

No comments:

Post a Comment