Summarizing Data with Aggregate Functions in PySpark _ sum(), sum_distinct(), bit_and() | PySpark Tutorial

PySpark Aggregation Functions: sum(), sum_distinct(), bit_and() Explained

🔍 PySpark Aggregation Functions: sum(), sum_distinct(), bit_and()

In this tutorial, we will explore key PySpark aggregation functions that help summarize your data efficiently. We’ll go over sum(), sum_distinct(), and bit_and() with practical examples using a DataFrame.

📌 Sample DataFrame

+------+------+
| name | sales|
+------+------+
| Aamir|  500 |
| Sara |  300 |
| John |  300 |
| Lina |  200 |
| Aamir|  550 |
| Sara |  650 |
| John |  800 |
| Lina |  250 |
+------+------+

➕ sum()

Description: Calculates the total sum of all values in a column.

from pyspark.sql.functions import sum

df_sum = df.agg(sum("sales").alias("total_sales"))
df_sum.show()

Output:

+------------+
| total_sales|
+------------+
|       3550 |
+------------+

📊 sum_distinct()

Description: Calculates the sum of only distinct (unique) values in a column.

from pyspark.sql.functions import sum_distinct

df_sum_distinct = df.agg(sum_distinct("sales").alias("total_distinct_sales"))
df_sum_distinct.show()

Output:

+---------------------+
| total_distinct_sales|
+---------------------+
|                3250 |
+---------------------+

⚙️ bit_and()

Description: Computes the bitwise AND of all non-null input values in a column.

from pyspark.sql.functions import bit_and

df_bit_and = df.agg(bit_and("sales").alias("bitwise_and_sales"))
df_bit_and.show()

Output:

+--------------------+
| bitwise_and_sales  |
+--------------------+
|                  0 |
+--------------------+

🎥 Watch the Video Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.