PySpark Tutorial: Using find_in_set() in PySpark | Search String Position in a Delimited List

Using find_in_set() in PySpark | String Position in Comma-Separated Values

Using find_in_set() in PySpark

In this tutorial, you'll learn how to use the find_in_set() function in PySpark to search for a specific string inside a comma-separated list. The function returns the position (1-based index) of the string if found, or 0 if not found.

📘 Sample Data

data = [
    ("apple", "apple,banana,grape"),
    ("car", "She went to the park. Then she read a book."),
    ("pen", "He is smart. Isn't he? Yes, he is.")
]

columns = ["word", "sentence"]
df = spark.createDataFrame(data, columns)
df.show()

Output:

+-----+-----------------------------+
|word |sentence                     |
+-----+-----------------------------+
|apple|apple,banana,grape           |
|car  |She went to the park...      |
|pen  |He is smart. Isn't he? Yes...|
+-----+-----------------------------+

🔍 Apply find_in_set() to Search Word Index

from pyspark.sql.functions import find_in_set, col

df = df.withColumn("word_index", find_in_set(col("word"), col("sentence")))
df.show(truncate=False)

Output:

+-----+--------------------------+-----------+
|word |sentence                  |word_index |
+-----+--------------------------+-----------+
|apple|apple,banana,grape        |1          |
|car  |She went to the park...   |0          |
|pen  |He is smart. Isn't he?... |0          |
+-----+--------------------------+-----------+

🎥 Watch the Full Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.