4
votes

I am new to Spark and Scala coming from R background.After a few transformations of RDD, I get a RDD of type

Description: RDD[(String, Int)]

Now I want to apply a Regular expression on the String RDD and extract substrings from the String and add just substring in a new coloumn.

Input Data :

BMW 1er Model,278
MINI Cooper Model,248

Output I am looking for :

   Input                  |  Brand   | Series      
BMW 1er Model,278,          BMW ,        1er        
MINI Cooper Model ,248      MINI ,      Cooper

where Brand and Series are newly calculated substrings from String RDD

What I have done so far.

I could achieve this for a String using regular expression, but I cani apply fro all lines.

 val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r //to look for BMW or MINI

Then I can use

brandRegEx.findFirstIn("hello this mini is bmW testing")

But how can I use it for all the lines of RDD and to apply different regular expression to achieve the output as above.

I read about this code snippet, but not sure how to put it altogether.

val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r

def getBrand(Col4: String) : String = Col4 match {
    case brandRegEx(str)  =>  
    case _ => ""
    return 'substring
}

Any help would be appreciated !

Thanks

2
Could elaborate what is exactly the question here? Regular expressions, pattern matching or RDD transformations?zero323
@zero323 : Sorry if the question is not readable. Basically I want to do Pattern matching in a string using regular expressions. This is in Spark-scala RDD.user3560220

2 Answers

6
votes

To apply your regex to each item in the RDD, you should use the RDD map function, which transforms each row in the RDD using some function (in this case, a Partial Function in order to extract to two parts of the tuple which makes up each row):

import org.apache.spark.{SparkContext, SparkConf}

object Example extends App {

  val sc = new SparkContext(new SparkConf().setMaster("local").setAppName("Example"))

  val data = Seq(
    ("BMW 1er Model",278),
    ("MINI Cooper Model",248))

  val dataRDD = sc.parallelize(data)

  val processedRDD = dataRDD.map{
    case (inString, inInt) =>
      val brandRegEx = """^.*[Bb][Mm][Ww]+|.[Mm][Ii][Nn][Ii]+.*$""".r
      val brand = brandRegEx.findFirstIn(inString)
      //val seriesRegEx = ...
      //val series = seriesRegEx.findFirstIn(inString)
      val series = "foo"
      (inString, inInt, brand, series)
  }

  processedRDD.collect().foreach(println)
  sc.stop()
}

Note that I think you have some problems in your regular expression, and you also need a regular expression for finding the series. This code outputs:

(BMW 1er Model,278,BMW,foo)
(MINI Cooper Model,248,NOT FOUND,foo)

But if you correct your regexes for your needs, this is how you can apply them to each row.

1
votes

hi I was just looking for aother question and got this question. The above problem can be done using normal transformations.

val a=sc.parallelize(collection)
a.map{case (x,y)=>(x.split (" ")(0)+" "+x.split(" ")(1))}.collect