PySpark Array Functions | array(), array_contains(), sort_array(), array_size() Explained with Examples
Introduction to PySpark Array Functions
In this tutorial, we will explore various PySpark Array functions that help with manipulating arrays. We will cover:
- array(): Creates an array from columns in a DataFrame.
- array_contains(): Checks if an array contains a specific element.
- sort_array(): Sorts an array in ascending or descending order.
- array_size(): Returns the size of the array.
Example 1: Creating Arrays with array()
Definition: The array()
function is used to create an array from columns in a DataFrame.
from pyspark.sql import SparkSession
from pyspark.sql.functions import array
# Initialize Spark session
spark = SparkSession.builder.appName("PySpark Array Functions").getOrCreate()
# Sample data
data = [("Aamir", 25), ("Sara", 30), ("John", 22)]
df = spark.createDataFrame(data, ["name", "age"])
# Create an array from columns
df_with_array = df.select("name", array("name", "age").alias("name_age_array"))
df_with_array.show()
Output:
+-----+-------------+
| name| name_age_array|
+-----+-------------+
| Aamir| [Aamir, 25] |
| Sara| [Sara, 30] |
| John| [John, 22] |
+-----+-------------+
Example 2: Using array_contains()
to Check for an Element
Definition: The array_contains()
function checks if an array contains a specific element.
from pyspark.sql.functions import array_contains
# Check if the array contains the value 30
df_with_check = df_with_array.withColumn("contains_30", array_contains("name_age_array", 30))
df_with_check.show()
Output:
+-----+-------------+------------+
| name| name_age_array| contains_30|
+-----+-------------+------------+
| Aamir| [Aamir, 25] | false|
| Sara| [Sara, 30] | true |
| John| [John, 22] | false|
+-----+-------------+------------+
Example 3: Sorting an Array with sort_array()
Definition: The sort_array()
function sorts an array in ascending or descending order.
from pyspark.sql.functions import sort_array
# Create an array and sort it
df_sorted = df.withColumn("sorted_ages", sort_array(array("age")).alias("sorted_ages"))
df_sorted.show()
Output:
+-----+---+-----------+
| name|age| sorted_ages|
+-----+---+-----------+
| Aamir| 25| [25] |
| Sara| 30| [30] |
| John| 22| [22] |
+-----+---+-----------+
Example 4: Getting Array Size with array_size()
Definition: The array_size()
function returns the size of the array.
from pyspark.sql.functions import array_size
# Get the size of the array
df_with_size = df_with_array.withColumn("array_size", array_size("name_age_array"))
df_with_size.show()
Output:
+-----+-------------+-----------+
| name| name_age_array| array_size|
+-----+-------------+-----------+
| Aamir| [Aamir, 25] | 2 |
| Sara| [Sara, 30] | 2 |
| John| [John, 22] | 2 |
+-----+-------------+-----------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.