How to Use first(), head(), and tail() Functions in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
In PySpark, the functions first(), head(), and tail() are used to retrieve specific rows from a DataFrame. These functions are particularly useful for inspecting data, debugging, and performing quick checks.
Why Use These Functions?
first()returns the first row of the DataFrame.head(n)returns the firstnrows of the DataFrame as a list of Row objects.tail(n)returns the lastnrows of the DataFrame as a list of Row objects.
Step 1: Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySparkFirstHeadTailFunctions") \
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
("Aamir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Bob", "Sales", 4200),
("Lisa", "Engineering", 6000)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 3: Using head() Function
# Get the first 3 rows using head()
head_rows = df.head(3)
# Print each row
for row in head_rows:
print(row)
Expected Output
Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
Row(Name='Ali', Department='Sales', Salary=4000)
Row(Name='Raza', Department='Marketing', Salary=3500)
Step 4: Using first() Function
# Get the first row
first_row = df.first()
# Print the first row
print(first_row)
Expected Output
Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
Step 5: Using tail() Function
# Get the last 2 rows
tail_rows = df.tail(2)
# Print each row
for row in tail_rows:
print(row)
Expected Output
Row(Name='Bob', Department='Sales', Salary=4200)
Row(Name='Lisa', Department='Engineering', Salary=6000)
Conclusion
PySpark provides several functions to access rows in a DataFrame. first(), head(), and tail() are simple yet powerful tools for data inspection and debugging. Understanding their differences helps in retrieving data more effectively during data processing tasks.
Watch the Video Tutorial
For a complete walkthrough of first(), head(), and tail() functions in PySpark, check out this video tutorial:



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.