1
votes

I have spark data frame like following:

 +----------+-------------------------------------------------+  
 |col1      |words                                            |  
 +----------+-------------------------------------------------+  
 |An        |[An, attractive, ,, thin, low, profile]          |  
 |attractive|[An, attractive, ,, thin, low, profile]          |  
 |,         |[An, attractive, ,, thin, low, profile]          |  
 |thin      |[An, attractive, ,, thin, low, profile]          |    
 |rail      |[An, attractive, ,, thin, low, profile]          |  
 |profile   |[An, attractive, ,, thin, low, profile]          |  
 |Lighter   |[Lighter, than, metal, ,, Level, ,, and, tes]    |  
 |than      |[Lighter, than, metal, ,, Level, ,, and, tww]    |  
 |steel     |[Lighter, than, metal, ,, Level, ,, and, test]   |  
 |,         |[Lighter, than, metal, ,, Level, ,, and, Test]   |  
 |Level     |[Lighter, than, metal, ,, Level, ,, and, test]   |  
 |,         |[Lighter, than, metal, ,, Level, ,, and, ste]    |  
 |and       |[Lighter, than, metal, ,, Level, ,, and, ste]    |  
 |Test      |[Lighter, than, metal, ,, Level, ,, and, Ste]    |  
 |Renewable |[Renewable, resource]                            |  
 |Resource  |[Renewable, resource]                            |  
 |No        |[No1, Bal, testme, saves, time, and, money]      |  
 +----------+-------------------------------------------------+  

I want to filter the data from the above column as case insensitive. Currently I am doing like this.

df.filter(array('words, "level")).show(false)

but it is not showing any data. please help me to resolve the issue.

2
How do you want the output look like ? - koiralo
dataframe from above which contains levels. - Narendra Mohan Prasad
Care to elaborate on "I want to filter the data from the above column as case insensitive. "? Do you want to search words in words column (that seems to be of array type)? Why not to use col1 instead since it's already available? - Jacek Laskowski
Sure @JacekLaskowski I will try to do that, and I wanted to select all the rows available with data which is available in array type column. - Narendra Mohan Prasad

2 Answers

2
votes

For this you can create a simple udf that converts both the case to lower case and filters

Here is the simple example,

scala> import spark.implicits._
import spark.implicits._

scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> val df = Seq(("An", List("An", "attractive"," ","", "thin", "low", "profile")), ("Lighter", List("Lighter", "than", "metal"," " ,"", "Level"," " ,"", "and", "tes"))).toDF("col1", "words")
df: org.apache.spark.sql.DataFrame = [col1: string, words: array<string>]

scala> val filterUdf = udf((arr: Seq[String]) => arr.map(_.toLowerCase).contains("level".toLowerCase))
filterUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,BooleanType,Some(List(ArrayType(StringType,true))))

scala> df.filter(filterUdf($"words")).show(false)

+-------+-------------------------------------------------+
|col1   |words                                            |
+-------+-------------------------------------------------+
|Lighter|[Lighter, than, metal,  , , Level,  , , and, tes]|
+-------+-------------------------------------------------+

Hope this helps!

1
votes

DataSets are much easier to work with than DataFrames, so I suggest you to convert your dataframe to dataset or just create DataSet from your source data.

supposing you have a dataset with case class as

case class data(col1: String, words: Array[String])

For illustration purpose I am creating a temporary dataset as

import sqlContext.implicits._
val ds = Seq(
  data("profile", Array("An", "attractive", "", "", "thin", "low", "profile")),
  data("Lighter", Array("Lighter", "than", "metal", "", "", "Level", "", "", "and", "tes"))
).toDS

which is similar to the dataframe you have

+-------+-----------------------------------------------+
|col1   |words                                          |
+-------+-----------------------------------------------+
|profile|[An, attractive, , , thin, low, profile]       |
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+

You can perform operations similar to RDD on the dataset and filter in the rows that contains Level as

ds.filter(row => row.words.map(element => element.toLowerCase).contains("level"))

the result is

+-------+-----------------------------------------------+
|col1   |words                                          |
+-------+-----------------------------------------------+
|Lighter|[Lighter, than, metal, , , Level, , , and, tes]|
+-------+-----------------------------------------------+

Updated

As you are struggling to convert dataframe to dataset, here's one of the method to do it

suppose you have a dataframe (df) as

+---+-------------+--------+---+
|age|maritalStatus|name    |sex|
+---+-------------+--------+---+
|35 |M            |Joanna  |F  |
|25 |S            |Isabelle|F  |
|19 |S            |Andy    |M  |
|70 |M            |Robert  |M  |
+---+-------------+--------+---+

Then a case class should be created to match the schema of df as

case class dataset(age: Int, maritalStatus: String, name: String, sex: String)

Then change of alias should do the trick as

val dataSet : Dataset[dataset] = df.as[dataset]

Then you can proceed as explained in the first part of this answer.

I hope the answer is helpful