PySpark Set-Like Array Functions: arrays_overlap(), array_union(), flatten(), array_distinct() Explained | PySpark Tutorial

106- PySpark Set-Like Array Functions | arrays_overlap(), array_union(), flatten(), array_distinct() Explained

106- PySpark Set-Like Array Functions | arrays_overlap(), array_union(), flatten(), array_distinct() Explained

In this tutorial, we will explore some useful PySpark array functions that deal with set-like operations. Functions like arrays_overlap(), array_union(), flatten(), and array_distinct() are essential for transforming and manipulating array data in a way that resembles set operations.

Introduction to Set-Like Array Functions

PySpark provides powerful array functions that allow us to perform set-like operations such as finding intersections between arrays, flattening nested arrays, and removing duplicates from arrays. These operations are similar to those used in set theory, but they operate on arrays in PySpark DataFrames.

1. arrays_overlap()

Definition: The arrays_overlap() function returns true if two arrays share at least one common element, otherwise false.


from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_overlap

# Sample data
data = [([1, 2, 3], [3, 4, 5]), ([6, 7, 8], [1, 2, 3])]

# Create Spark session
spark = SparkSession.builder.appName("SetLikeFunctions").getOrCreate()

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply arrays_overlap function
df.select(arrays_overlap("array1", "array2").alias("overlap")).show()
        

Output:


+------+
|overlap|
+------+
|  true|
| false|
+------+
        

2. array_union()

Definition: The array_union() function returns a new array that contains the union of the two input arrays, removing any duplicates.


from pyspark.sql.functions import array_union

# Sample data
data = [([1, 2, 3], [3, 4, 5]), ([6, 7], [7, 8, 9])]

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply array_union function
df.select(array_union("array1", "array2").alias("union")).show()
        

Output:


+-----------+
|      union|
+-----------+
|[1, 2, 3, 4, 5]|
|[6, 7, 8, 9]|
+-----------+
        

3. flatten()

Definition: The flatten() function takes an array of arrays and returns a single array with all the elements from the inner arrays.


from pyspark.sql.functions import flatten

# Sample data
data = [([1, 2], [3, 4]), ([5, 6], [7, 8])]

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply flatten function
df.select(flatten("array1").alias("flattened_array")).show()
        

Output:


+--------------+
|flattened_array|
+--------------+
|    [1, 2]|
|    [5, 6]|
+--------------+
        

4. array_distinct()

Definition: The array_distinct() function returns an array with all duplicate elements removed.


from pyspark.sql.functions import array_distinct

# Sample data
data = [([1, 2, 2, 3], [3, 4, 4, 5]), ([6, 7, 7, 8], [7, 8, 9, 9])]

# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])

# Apply array_distinct function
df.select(array_distinct("array1").alias("distinct_array")).show()
        

Output:


+--------------+
|distinct_array|
+--------------+
|    [1, 2, 3]|
|    [6, 7, 8]|
+--------------+
        

Watch the Tutorial

In this tutorial, we explored set-like operations on arrays using PySpark's built-in functions like arrays_overlap(), array_union(), flatten(), and array_distinct(). These functions are highly useful for data manipulation and transformation in PySpark DataFrames.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.