PySpark Date Truncation Tutorial : trunc(), date_trunc(), last_day() with Real Examples | PySpark Tutorial

104- PySpark Date Truncation Tutorial | trunc(), date_trunc(), last_day() with Real Examples

104- PySpark Date Truncation Tutorial | trunc(), date_trunc(), last_day() with Real Examples

Introduction

In this tutorial, we'll dive into PySpark's date truncation functions: trunc(), date_trunc(), and last_day(). These functions are extremely helpful for manipulating date and time data efficiently. We will walk through real-world examples to demonstrate how to use them effectively for various date manipulations.

1. trunc() Function

Definition: The trunc() function truncates a date or timestamp to a specific unit, such as year, month, day, etc.

Example: Truncate the date to the beginning of the year:


from pyspark.sql.functions import trunc
from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("DateTruncation").getOrCreate()

# Sample DataFrame
data = [("2022-07-14",), ("2021-11-09",)]
df = spark.createDataFrame(data, ["date"])

# Truncate the date to the start of the year
df.select(trunc("date", "YYYY").alias("year_start")).show(truncate=False)
        

Output:


+----------+
|year_start|
+----------+
|2022-01-01|
|2021-01-01|
+----------+
        

2. date_trunc() Function

Definition: The date_trunc() function truncates a date or timestamp to the specified unit, such as "hour", "minute", or "day". This function is more flexible than trunc().

Example: Truncate the date to the start of the month:


from pyspark.sql.functions import date_trunc

# Truncate the date to the start of the month
df.select(date_trunc("month", "date").alias("month_start")).show(truncate=False)
        

Output:


+-----------+
|month_start|
+-----------+
|2022-07-01 |
|2021-11-01 |
+-----------+
        

3. last_day() Function

Definition: The last_day() function returns the last day of the month for the given date or timestamp.

Example: Get the last day of the month:


from pyspark.sql.functions import last_day

# Get the last day of the month
df.select(last_day("date").alias("last_day_of_month")).show(truncate=False)
        

Output:


+----------------+
|last_day_of_month|
+----------------+
|2022-07-31      |
|2021-11-30      |
+----------------+
        

Watch the Full Tutorial

For more PySpark tutorials, don't forget to subscribe to my channel and check out other tutorials.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.