1
votes

currently i'm new student to scala, Looking for scala coding help in parsing the string to a case class,

case class CategaryIds(id1: Long, id2: Long, id3: Long, secIds: Set[Long])

data looks like below represented as spark RDD

600045,8114,31679,"{1:2:3:4}"
600034,8114,34526,
600056,8114,31679,"{1:2:3:4}"

tried below code, throws exception arrayoutofbund exception and numberformat exception

val fields = line.split(",").map(_.trim);
CategaryIds(fields(0).toLong,fields(1).toLong,fields(2).toLong,fields(3).replace("{","").replace("}", "").split(":").map(_.toLong).toSet)}

If any better way to achieve this, please share it

2
You get the arrayoutofbounds when you try to access fields(3) when there is no element at that index, like in the second element of your RDD. - Ton Torres

2 Answers

0
votes

Something like this works.

val fields = line.split(",").map(_.trim).toSeq

val seq = if (fields.size > 3) fields(3).split("\"{:}".toCharArray).filter(_ != "").map(_.toLong).toSet else Set[Long]()
CategaryIds(fields(0).toLong, fields(1).toLong,fields(2).toLong, seq)

First check that the Set is not empty so you don't get an ArrayIndexOutOfBoundsException, then split by the separators and convert them to Longs

0
votes

May be Regex will be more suited

val r = """(\d*),(\d*),(\d*),(?:"\{(.*)\}")?""".r

"""600045,8114,31679,"{1:2:3:4}"""" match {
  case r(a,b,c,d) => println(s"a:$a, b:$b, c:$c, d:$d")
  case _ => println("no match")
}

"""600034,8114,34526,""" match {
  case r(a,b,c,d) => println(s"a:$a, b:$b, c:$c, d:$d")
  case _ => println("no match")
}
r: scala.util.matching.Regex = (\d*),(\d*),(\d*),(?:"\{(.*)\}")?

scala>      |      |      | a:600045, b:8114, c:31679, d:1:2:3:4

scala>      |      |      | a:600034, b:8114, c:34526, d:null

and you can use it

val r = """(\d*),(\d*),(\d*),(?:"\{(.*)\}")?""".r

somelines.map{
  case r(a,b,c,null) => 
    CategaryIds(a.toLong, b.toLong, c.toLong, Set())
  case r(a,b,c,d) => 
    CategaryIds(a.toLong, b.toLong, c.toLong, d.split[":"].toSet.map(_.toLong))
}