106- PySpark Set-Like Array Functions | arrays_overlap(), array_union(), flatten(), array_distinct() Explained
In this tutorial, we will explore some useful PySpark array functions that deal with set-like operations. Functions like arrays_overlap()
, array_union()
, flatten()
, and array_distinct()
are essential for transforming and manipulating array data in a way that resembles set operations.
Introduction to Set-Like Array Functions
PySpark provides powerful array functions that allow us to perform set-like operations such as finding intersections between arrays, flattening nested arrays, and removing duplicates from arrays. These operations are similar to those used in set theory, but they operate on arrays in PySpark DataFrames.
1. arrays_overlap()
Definition: The arrays_overlap()
function returns true
if two arrays share at least one common element, otherwise false
.
from pyspark.sql import SparkSession
from pyspark.sql.functions import arrays_overlap
# Sample data
data = [([1, 2, 3], [3, 4, 5]), ([6, 7, 8], [1, 2, 3])]
# Create Spark session
spark = SparkSession.builder.appName("SetLikeFunctions").getOrCreate()
# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])
# Apply arrays_overlap function
df.select(arrays_overlap("array1", "array2").alias("overlap")).show()
Output:
+------+
|overlap|
+------+
| true|
| false|
+------+
2. array_union()
Definition: The array_union()
function returns a new array that contains the union of the two input arrays, removing any duplicates.
from pyspark.sql.functions import array_union
# Sample data
data = [([1, 2, 3], [3, 4, 5]), ([6, 7], [7, 8, 9])]
# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])
# Apply array_union function
df.select(array_union("array1", "array2").alias("union")).show()
Output:
+-----------+
| union|
+-----------+
|[1, 2, 3, 4, 5]|
|[6, 7, 8, 9]|
+-----------+
3. flatten()
Definition: The flatten()
function takes an array of arrays and returns a single array with all the elements from the inner arrays.
from pyspark.sql.functions import flatten
# Sample data
data = [([1, 2], [3, 4]), ([5, 6], [7, 8])]
# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])
# Apply flatten function
df.select(flatten("array1").alias("flattened_array")).show()
Output:
+--------------+
|flattened_array|
+--------------+
| [1, 2]|
| [5, 6]|
+--------------+
4. array_distinct()
Definition: The array_distinct()
function returns an array with all duplicate elements removed.
from pyspark.sql.functions import array_distinct
# Sample data
data = [([1, 2, 2, 3], [3, 4, 4, 5]), ([6, 7, 7, 8], [7, 8, 9, 9])]
# Create DataFrame
df = spark.createDataFrame(data, ["array1", "array2"])
# Apply array_distinct function
df.select(array_distinct("array1").alias("distinct_array")).show()
Output:
+--------------+
|distinct_array|
+--------------+
| [1, 2, 3]|
| [6, 7, 8]|
+--------------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.