I have a dataframe which consists of two columns, one an Int and the other a String:
+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+
I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray
method in a udf:
def getSyn( w2v : Word2VecModel ) = udf { (token : String) => w2v.findSynonymsArray(token, 10)}
However, this udf causes NullPointerException
when used with withColumn
. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.
I have queried the dataframe for null values, there are none in either column.
I also tried extracting the words and vectors from the Word2VecModel
with getVectors
, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.
I would greatly appreciate any help.