How to Use count() Function in PySpark
The count() function in PySpark returns the number of rows in a DataFrame. In this tutorial, you'll learn how to use count(), distinct().count(), and groupBy().count() with examples and expected outputs.
1. Import SparkSession and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkCountFunction").getOrCreate()
2. Create Sample Data
data = [
("Amir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Amir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000)
]
3. Define Schema
columns = ["Name", "Department", "Salary"]
4. Create a DataFrame
df = spark.createDataFrame(data, schema=columns)
5. Show the DataFrame
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
|Amir Shahzad |Engineering| 5000|
| Ali| Sales| 4000|
+-------------+-----------+------+
6. count() - Total Number of Rows (Including Duplicates)
total_rows = df.count()
print("Total number of rows:", total_rows)
Expected Output
Total number of rows: 5
7. distinct().count() - Counts Unique Rows
distinct_rows = df.distinct().count()
print("Number of distinct rows:", distinct_rows)
Expected Output
Number of distinct rows: 3
8. groupBy() + count() - Count Occurrences of Each Name
df.groupBy("Name").count().show()
Expected Output
+-------------+-----+
| Name|count|
+-------------+-----+
| Raza| 1|
| Ali| 2|
|Amir Shahzad | 2|
+-------------+-----+
Conclusion
In this tutorial, you have learned how to use the count() function in PySpark to get the total number of rows, unique rows with distinct().count(), and count occurrences by grouping data using groupBy().count().



No comments:
Post a Comment
Note: Only a member of this blog may post a comment.