How do I remove words numerics pyspark

Question

I'm trying to remove only words that are numerical from my words array, but the function I created is not working correctly. When I try to view the information from my dataframe the following error message appears

First I converted my string and word tokens

from pyspark.ml.feature import RegexTokenizer
regexTokenizer = RegexTokenizer(
    inputCol="description",
    outputCol="words_withnumber",
    pattern="\\W"
)

data = regexTokenizer.transform(data)

I created the function to remove only the numbers

from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
    if value:
        return value.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType())

Call function

data = data.withColumn(
    'words_withoutnumber', 
    when(~is_digit_udf(data['words_withnumber']), data['words_withnumber'])
)

Error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 4 times, most recent failure: Lost task 0.3 in stage 5.0 (TID 14, 10.139.64.4, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):

Sample Dataframe

+-----------+-----------------------------------------------------------+
|categoryid |description                                                |
+-----------+-----------------------------------------------------------+
|      33004|["short","sarja", "40567","detalhe","couro"]               | 
|      22033|["multipane","6768686868686867868888","220v","branco"]     | 
+-----------+-----------------------------------------------------------+

expected result

+-----------+-----------------------------------------------------------+
|categoryid |description                                                |
+-----------+-----------------------------------------------------------+
|      33004|["short","sarja","detalhe","couro"]                        | 
|      22033|["multipane","220v","branco"]                              |
+-----------+-----------------------------------------------------------+

You udf expects a string, but you are passing an array to it. Also in your sample data frame should description be words_withnumber? — Psidom
You need to iterate over the array and filter out the desired words. What version of spark are you using? — pault
@Psidom, I tried to loop through array but I got the following error message "name 'ArrayType' is not defined — user3661384
@pault version spark 2.3.1. I see link, i try this filter_udf = udf(lambda row: [x for x in row if is_digit(x)], ArrayType(StringType())) , i receive error "name 'ArrayType' is not defined" – user3661384 25 mins ago — user3661384

user3661384 user3661384 · Accepted Answer · 2018-12-28T16:31:47

As a help @pault the solution was this.

from pyspark.sql.functions import when,udf
from pyspark.sql.types import BooleanType

def is_digit(value):
    if value:
        return value.isdigit()
    else:
        return False

is_digit_udf = udf(is_digit, BooleanType()

Call function

from pyspark.sql.types import ArrayType, StringType
from pyspark.sql.types import StructType

filter_length_udf = udf(lambda row: [x for x in row if not is_digit(x)], ArrayType(StringType()))

data = data.withColumn('words_clean', filter_length_udf(col('words_withnumber')))

How do I remove words numerics pyspark

2 Answers