When running following piece of PySpark code:
nlp = NLPFunctions()
def parse_ingredients(ingredient_lines):
parsed_ingredients = nlp.getingredients_bulk(ingredient_lines)[0]
return list(chain.from_iterable(parsed_ingredients))
udf_parse_ingredients = UserDefinedFunction(parse_ingredients, ArrayType(StringType()))
I get the following error:
_pickle.PicklingError: Could not serialize object: TypeError: can't pickle _thread.lock objects
I imagine this is because PySpark can not serialize this custom class. But how can I avoid the overhead of instantiating this expensive object on every run of the parse_ingredients_line
function?