Perform NGram on Spark DataFrame

Question

I'm using Spark 2.3.1, I have Spark DataFrame like this

+----------+
|    values|
+----------+
|embodiment|
|   present|
| invention|
|   include|
|   pairing|
|       two|
|  wireless|
|    device|
|   placing|
|     least|
|       one|
|       two|
+----------+

I want to perform a Spark ml n-Gram feature like this.

bigram = NGram(n=2, inputCol="values", outputCol="bigrams")

bigramDataFrame = bigram.transform(tokenized_df)

Following Error occurred on this line bigramDataFrame = bigram.transform(tokenized_df)

pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'

So I changed my code

df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))

bigram = NGram(n=2, inputCol="values", outputCol="bigrams")

bigramDataFrame = bigram.transform(df_new)

bigramDataFrame.show()

So I got my final Data Frame as Follow

+----------+------------+-------+
|    values|     testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]|     []|
|   present|   [present]|     []|
| invention| [invention]|     []|
|   include|   [include]|     []|
|   pairing|   [pairing]|     []|
|       two|       [two]|     []|
|  wireless|  [wireless]|     []|
|    device|    [device]|     []|
|   placing|   [placing]|     []|
|     least|     [least]|     []|
|       one|       [one]|     []|
|       two|       [two]|     []|
+----------+------------+-------+

Why my bigram column value is empty.

I want my output for bigram column as follow

+----------+
|   bigrams|
+--------------------+
|embodiment present  |
|present invention   |
|invention include   |
|include pairing     |
|pairing two         |
|two wireless        |
|wireless device     |
|device placing      |
|placing least       |
|least one           |
|one two             |
+--------------------+

do you want something like: df.select(F.concat_ws(" ",F.col("values"),F.lead("values").over(Window.orderBy(F.lit(None))))).show() ? — anky
@anky Your suggestion is right, can you please explain this as the answer to this post, and please suggest what I have to do to concrete three or four rows as well. I've tried myself, but it doesn't worked. Do you have any idea why pyspark ml lib feature of n-gram not working (While I ran the same code in HDP Sandbox it's work as expected same spark version) btw I'm running spark on local using spark-submit command. — Achyut Vyas

SD3 SD3 · Accepted Answer · 2020-08-08T14:39:15

Your bi-gram column value is empty because there are no bi-grams in each row of your 'values' column.

If your values in input data frame look like:

+--------------------------------------------+
|values                                      |
+--------------------------------------------+
|embodiment present invention include pairing|
|two wireless device placing                 |
|least one two                               |
+--------------------------------------------+

Then you can get the output in bi-grams as below:

+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|values                                      |testing                                           |ngrams                                                                     |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+
|embodiment present invention include pairing|[embodiment, present, invention, include, pairing]|[embodiment present, present invention, invention include, include pairing]|
|two wireless device placing                 |[two, wireless, device, placing]                  |[two wireless, wireless device, device placing]                            |
|least one two                               |[least, one, two]                                 |[least one, one two]                                                       |
+--------------------------------------------+--------------------------------------------------+---------------------------------------------------------------------------+

The scala spark code to do this is:

val df_new = df.withColumn("testing", split(df("values")," "))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)

A bi-gram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words.

But in your input data frame, you have only one token in each row, hence you are not getting any bi-grams out of it.

So, for your question, you can do something like this:

Input: df1
+----------+
|values    |
+----------+
|embodiment|
|present   |
|invention |
|include   |
|pairing   |
|two       |
|wireless  |
|devic     |
|placing   |
|least     |
|one       |
|two       |
+----------+

Output: ngramDataFrameInRows
+------------------+
|ngrams            |
+------------------+
|embodiment present|
|present invention |
|invention include |
|include pairing   |
|pairing two       |
|two wireless      |
|wireless devic    |
|devic placing     |
|placing least     |
|least one         |
|one two           |
+------------------+

Spark scala code:

val df_new=df1.agg(collect_list("values").alias("testing"))
val ngram = new NGram().setN(2).setInputCol("testing").setOutputCol("ngrams")
val ngramDataFrame = ngram.transform(df_new)
val ngramDataFrameInRows=ngramDataFrame.select(explode(col("ngrams")).alias("ngrams"))

Perform NGram on Spark DataFrame

1 Answers