How to Use count() Function in PySpark | Step-by-Step Guide

How to Use count() Function in PySpark

The count() function in PySpark returns the number of rows in a DataFrame. In this tutorial, you'll learn how to use count(), distinct().count(), and groupBy().count() with examples and expected outputs.

1. Import SparkSession and Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("PySparkCountFunction").getOrCreate()

2. Create Sample Data

data = [
    ("Amir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000),
    ("Raza", "Marketing", 3500),
    ("Amir Shahzad", "Engineering", 5000),
    ("Ali", "Sales", 4000)
]

3. Define Schema

columns = ["Name", "Department", "Salary"]

4. Create a DataFrame

df = spark.createDataFrame(data, schema=columns)

5. Show the DataFrame

df.show()

Expected Output

+-------------+-----------+------+
|         Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering|  5000|
|          Ali|      Sales|  4000|
|         Raza|  Marketing|  3500|
|Amir Shahzad |Engineering|  5000|
|          Ali|      Sales|  4000|
+-------------+-----------+------+

6. count() - Total Number of Rows (Including Duplicates)

total_rows = df.count()
print("Total number of rows:", total_rows)

Expected Output

Total number of rows: 5

7. distinct().count() - Counts Unique Rows

distinct_rows = df.distinct().count()
print("Number of distinct rows:", distinct_rows)

Expected Output

Number of distinct rows: 3

8. groupBy() + count() - Count Occurrences of Each Name

df.groupBy("Name").count().show()

Expected Output

+-------------+-----+
|         Name|count|
+-------------+-----+
|         Raza|    1|
|          Ali|    2|
|Amir Shahzad |    2|
+-------------+-----+

Conclusion

In this tutorial, you have learned how to use the count() function in PySpark to get the total number of rows, unique rows with distinct().count(), and count occurrences by grouping data using groupBy().count().

Welcome To TechBrothersIT

Label

How to use PySpark count() Function | Count Rows & Records Easily | PySpark Tutorial

How to Use count() Function in PySpark

1. Import SparkSession and Create Spark Session

2. Create Sample Data

3. Define Schema

4. Create a DataFrame

5. Show the DataFrame

Expected Output

6. count() - Total Number of Rows (Including Duplicates)

Expected Output

7. distinct().count() - Counts Unique Rows

Expected Output

8. groupBy() + count() - Count Occurrences of Each Name

Expected Output

Conclusion

Watch the Video Tutorial

No comments:

Post a Comment