I'm using Spark 2.3.1, I have Spark DataFrame like this
+----------+
| values|
+----------+
|embodiment|
| present|
| invention|
| include|
| pairing|
| two|
| wireless|
| device|
| placing|
| least|
| one|
| two|
+----------+
I want to perform a Spark ml n-Gram feature like this.
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(tokenized_df)
Following Error occurred on this line bigramDataFrame = bigram.transform(tokenized_df)
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Input type must be ArrayType(StringType) but got StringType.'
So I changed my code
df_new = tokenized_df.withColumn("testing", array(tokenized_df["values"]))
bigram = NGram(n=2, inputCol="values", outputCol="bigrams")
bigramDataFrame = bigram.transform(df_new)
bigramDataFrame.show()
So I got my final Data Frame as Follow
+----------+------------+-------+
| values| testing|bigrams|
+----------+------------+-------+
|embodiment|[embodiment]| []|
| present| [present]| []|
| invention| [invention]| []|
| include| [include]| []|
| pairing| [pairing]| []|
| two| [two]| []|
| wireless| [wireless]| []|
| device| [device]| []|
| placing| [placing]| []|
| least| [least]| []|
| one| [one]| []|
| two| [two]| []|
+----------+------------+-------+
Why my bigram column value is empty.
I want my output for bigram column as follow
+----------+
| bigrams|
+--------------------+
|embodiment present |
|present invention |
|invention include |
|include pairing |
|pairing two |
|two wireless |
|wireless device |
|device placing |
|placing least |
|least one |
|one two |
+--------------------+
df.select(F.concat_ws(" ",F.col("values"),F.lead("values").over(Window.orderBy(F.lit(None))))).show()
? – anky