PySpark Array Functions : array(), array_contains(), sort_array(), array_size() Explained with Examples | PySpark Tutorial

PySpark Array Functions | array(), array_contains(), sort_array(), array_size() Explained with Examples

PySpark Array Functions | array(), array_contains(), sort_array(), array_size() Explained with Examples

Introduction to PySpark Array Functions

In this tutorial, we will explore various PySpark Array functions that help with manipulating arrays. We will cover:

  • array(): Creates an array from columns in a DataFrame.
  • array_contains(): Checks if an array contains a specific element.
  • sort_array(): Sorts an array in ascending or descending order.
  • array_size(): Returns the size of the array.

Example 1: Creating Arrays with array()

Definition: The array() function is used to create an array from columns in a DataFrame.


from pyspark.sql import SparkSession
from pyspark.sql.functions import array

# Initialize Spark session
spark = SparkSession.builder.appName("PySpark Array Functions").getOrCreate()

# Sample data
data = [("Aamir", 25), ("Sara", 30), ("John", 22)]
df = spark.createDataFrame(data, ["name", "age"])

# Create an array from columns
df_with_array = df.select("name", array("name", "age").alias("name_age_array"))
df_with_array.show()
        

Output:


+-----+-------------+
| name| name_age_array|
+-----+-------------+
| Aamir| [Aamir, 25] |
| Sara| [Sara, 30] |
| John| [John, 22] |
+-----+-------------+
        

Example 2: Using array_contains() to Check for an Element

Definition: The array_contains() function checks if an array contains a specific element.


from pyspark.sql.functions import array_contains

# Check if the array contains the value 30
df_with_check = df_with_array.withColumn("contains_30", array_contains("name_age_array", 30))
df_with_check.show()
        

Output:


+-----+-------------+------------+
| name| name_age_array| contains_30|
+-----+-------------+------------+
| Aamir| [Aamir, 25] | false|
| Sara| [Sara, 30] | true |
| John| [John, 22] | false|
+-----+-------------+------------+
        

Example 3: Sorting an Array with sort_array()

Definition: The sort_array() function sorts an array in ascending or descending order.


from pyspark.sql.functions import sort_array

# Create an array and sort it
df_sorted = df.withColumn("sorted_ages", sort_array(array("age")).alias("sorted_ages"))
df_sorted.show()
        

Output:


+-----+---+-----------+
| name|age| sorted_ages|
+-----+---+-----------+
| Aamir| 25| [25] |
| Sara| 30| [30] |
| John| 22| [22] |
+-----+---+-----------+
        

Example 4: Getting Array Size with array_size()

Definition: The array_size() function returns the size of the array.


from pyspark.sql.functions import array_size

# Get the size of the array
df_with_size = df_with_array.withColumn("array_size", array_size("name_age_array"))
df_with_size.show()
        

Output:


+-----+-------------+-----------+
| name| name_age_array| array_size|
+-----+-------------+-----------+
| Aamir| [Aamir, 25] | 2 |
| Sara| [Sara, 30] | 2 |
| John| [John, 22] | 2 |
+-----+-------------+-----------+
        

Watch the Full Tutorial

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.