How to Use first(), head(), and tail() Functions in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
In PySpark, the functions first(), head(), and tail() are used to retrieve specific rows from a DataFrame. These functions are particularly useful for inspecting data, debugging, and performing quick checks.
Why Use These Functions?
first()returns the first row of the DataFrame.head(n)returns the firstnrows of the DataFrame as a list of Row objects.tail(n)returns the lastnrows of the DataFrame as a list of Row objects.
Step 1: Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .appName("PySparkFirstHeadTailFunctions") \
    .getOrCreate()
  Step 2: Create a Sample DataFrame
data = [
    ("Aamir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000),
    ("Raza", "Marketing", 3500),
    ("Bob", "Sales", 4200),
    ("Lisa", "Engineering", 6000)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
    Expected Output
+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|          Bob|      Sales|  4200|
|         Lisa|Engineering|  6000|
+-------------+-----------+------+
  Step 3: Using head() Function
# Get the first 3 rows using head()
head_rows = df.head(3)
# Print each row
for row in head_rows:
    print(row)
    Expected Output
Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
Row(Name='Ali', Department='Sales', Salary=4000)
Row(Name='Raza', Department='Marketing', Salary=3500)
  Step 4: Using first() Function
# Get the first row
first_row = df.first()
# Print the first row
print(first_row)
    Expected Output
Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
  Step 5: Using tail() Function
# Get the last 2 rows
tail_rows = df.tail(2)
# Print each row
for row in tail_rows:
    print(row)
    Expected Output
Row(Name='Bob', Department='Sales', Salary=4200)
Row(Name='Lisa', Department='Engineering', Salary=6000)
  Conclusion
PySpark provides several functions to access rows in a DataFrame. first(), head(), and tail() are simple yet powerful tools for data inspection and debugging. Understanding their differences helps in retrieving data more effectively during data processing tasks.
Watch the Video Tutorial
For a complete walkthrough of first(), head(), and tail() functions in PySpark, check out this video tutorial:



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.