I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears
First I converted my string and word tokens
from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
inputCol="description",
outputCol="words_withnumber",
pattern="\\W"
)
data = regexTokenizer.transform(data)
I created the function to remove only the numbers
from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType
def is_digit(value):
if value:
return value.isdigit()
else:
return False
is_digit_udf = udf(is_digit, BooleanType())
Call function
data = data.withColumn(
'words_withoutnumber',
when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)
Error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
Sample Dataframe
+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja", "40567","detalhe","couro"] |
| 22033|["multipane","6768686868686867868888","220v","branco"] |
+-----------+-----------------------------------------------------------+
expected result
+-----------+-----------------------------------------------------------+
|categoryid |description |
+-----------+-----------------------------------------------------------+
| 33004|["short","sarja","detalhe","couro"] |
| 22033|["multipane","220v","branco"] |
+-----------+-----------------------------------------------------------+
description
bewords_withnumber
? – Psidomfrom pyspark.sql.types import ArrayType
– pault