Using find_in_set() in PySpark
In this tutorial, you'll learn how to use the find_in_set()
function in PySpark to search for a specific string inside a comma-separated list. The function returns the position (1-based index) of the string if found, or 0 if not found.
📘 Sample Data
data = [
("apple", "apple,banana,grape"),
("car", "She went to the park. Then she read a book."),
("pen", "He is smart. Isn't he? Yes, he is.")
]
columns = ["word", "sentence"]
df = spark.createDataFrame(data, columns)
df.show()
Output:
+-----+-----------------------------+
|word |sentence |
+-----+-----------------------------+
|apple|apple,banana,grape |
|car |She went to the park... |
|pen |He is smart. Isn't he? Yes...|
+-----+-----------------------------+
🔍 Apply find_in_set() to Search Word Index
from pyspark.sql.functions import find_in_set, col
df = df.withColumn("word_index", find_in_set(col("word"), col("sentence")))
df.show(truncate=False)
Output:
+-----+--------------------------+-----------+
|word |sentence |word_index |
+-----+--------------------------+-----------+
|apple|apple,banana,grape |1 |
|car |She went to the park... |0 |
|pen |He is smart. Isn't he?... |0 |
+-----+--------------------------+-----------+
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.