2
votes

Firstly, I am completely new to scala and spark Although bit famailiar with pyspark. I am working with external json file which is pretty huge and I am not allowed to convert it into dataset or dataframe. I have to perform operations on pure RDD.

So I wanted to know how can I get specific value of key. So I read my json file as sc.textFile("information.json") Now normally in python I would do like

x = sc.textFile("information.json").map(lambda x: json.loads(x))\ 
 .map(lambda x: (x['name'],x['roll_no'])).collect()

is there any equivalent of above code in scala (Extracting value of specific keys) in RDD without converting to dataframe or dataset.

Essentially same question as Equivalent pyspark's json.loads function for spark-shell but hoping to get more concrete and noob friendly answer. Thank you

Json data: {"name":"ABC", "roll_no":"12", "Major":"CS"}

2
Can you give an example of your json please ? - SimbaPK
Updated with JSON data - Max
My answer on how to parse Json with scala should help you - SimbaPK
is there any specific reason for not using spark.read.json? then you dont need to do any custom parsing - abiratsis

2 Answers

2
votes

Option 1: RDD API + json4s lib

One way is using the json4s library. The library is already used internally by Spark.

import org.json4s._
import org.json4s.jackson.JsonMethods._

// {"name":"ABC1", "roll_no":"12", "Major":"CS1"}
// {"name":"ABC2", "roll_no":"13", "Major":"CS2"}
// {"name":"ABC3", "roll_no":"14", "Major":"CS3"}
val file_location = "information.json"

val rdd = sc.textFile(file_location)

rdd.map{ row =>
  val json_row = parse(row)

  (compact(json_row \ "name"), compact(json_row \ "roll_no"))
}.collect().foreach{println _}

// Output
// ("ABC1","12")
// ("ABC2","13")
// ("ABC3","14")

First we parse the row data into json_row then we access the properties of the row with the operator \ i.e: json_row \ "name". The final result is a sequence of tuples of name,roll_no

Option 2: dataframe API + get_json_object()

And a more straight forward approach would be via the dataframe API in combination with the get_json_object() function.

import org.apache.spark.sql.functions.get_json_object

val df = spark.read.text(file_location)

df.select(
  get_json_object($"value","$.name").as("name"),
  get_json_object($"value","$.roll_no").as("roll_no"))
.collect()
.foreach{println _}

// [ABC1,12]
// [ABC2,13]
// [ABC3,14]
0
votes

i used to parse json in scala with this kind of method :

 /** ---------------------------------------
    * Example of method to parse simple json
        {
        "fields": [
          {
            "field1": "value",
            "field2": "value",
            "field3": "value"
          }
        ]
      }*/

import scala.io.Source
import scala.util.parsing.json._

  case class outputData(field1 : String, field2: String, field3 : String)

  def singleMapJsonParser(JsonDataFile : String) : List[outputData] = {

    val JsonData : String = Source.fromFile(JsonDataFile).getLines.mkString

    val jsonFormatData = JSON.parseFull(JsonData).map{
      case json : Map[String, List[Map[String,String]]] =>
        json("fields").map(v => outputData(v("field1"),v("field2"),v("field3")))
    }.get

    jsonFormatData
  }

Then you just have to call your sparkContext to transform le List[Class] output to RDD