🔍 PySpark Aggregation Functions: sum(), sum_distinct(), bit_and()
In this tutorial, we will explore key PySpark aggregation functions that help summarize your data efficiently. We’ll go over sum(), sum_distinct(), and bit_and() with practical examples using a DataFrame.
📌 Sample DataFrame
+------+------+
| name | sales|
+------+------+
| Aamir| 500 |
| Sara | 300 |
| John | 300 |
| Lina | 200 |
| Aamir| 550 |
| Sara | 650 |
| John | 800 |
| Lina | 250 |
+------+------+
➕ sum()
Description: Calculates the total sum of all values in a column.
from pyspark.sql.functions import sum
df_sum = df.agg(sum("sales").alias("total_sales"))
df_sum.show()
Output:
+------------+
| total_sales|
+------------+
| 3550 |
+------------+
📊 sum_distinct()
Description: Calculates the sum of only distinct (unique) values in a column.
from pyspark.sql.functions import sum_distinct
df_sum_distinct = df.agg(sum_distinct("sales").alias("total_distinct_sales"))
df_sum_distinct.show()
Output:
+---------------------+
| total_distinct_sales|
+---------------------+
| 3250 |
+---------------------+
⚙️ bit_and()
Description: Computes the bitwise AND of all non-null input values in a column.
from pyspark.sql.functions import bit_and
df_bit_and = df.agg(bit_and("sales").alias("bitwise_and_sales"))
df_bit_and.show()
Output:
+--------------------+
| bitwise_and_sales |
+--------------------+
| 0 |
+--------------------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.