0
votes

I am new in SparkSQL, I am trying to parse and show the data of a JSON file. So My question is I cannot understand the Line number 2 in my code given below, why the builder function is not like the Documentation(.setAppName instead of .appName etc..) I have mentioned given below? What does mean of this added portion- "some-value").getOrCreate() in Line number 2 of my code? I will be thankful from the bottom of my heart if someone help me to understand this.

employee.json

{"name":"John", "age":28}
{"name":"Andrew", "age":36}
{"name":"Clarke", "age":22}
{"name":"Kevin",  "age":42}
{"name":"Richard","age":51}

Code:

 1. import org.apache.spark.sql.SparkSession
 2. val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
 3. import spark.implicits._
 4. val df = spark.read.json("examples/src/main/resources/employee.json")
 5. df.show()

Output:

+---+-------+
|age|   name|
+---+-------+
| 28|   John|
| 36| Andrew|
| 22| Clarke|
| 42|  Kevin|
| 51|Richard|
+---+-------+

=============================>>>>>>>>>> Please Note:

According to the Documentation of SparkConf passed to your SparkContext. SparkConf allows you to configure some of the common properties (e.g. master URL and application name), as well as arbitrary key-value pairs through the set() method. as follows::

val conf = new SparkConf().setMaster("local[2]").setAppName("CountingSheep")

val sc = new SparkContext(conf)

3
Some config is only just some configuration option... Don't read it literallyOneCricketeer

3 Answers

0
votes

See nowadays from Spark 2.0.0 , if you are using the sql part of the Spark then SparkSession is the de-facto standard that you need to follow to create.

In the Second line i.e. 
 2. val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()

You can write this same thing using the spark conf as you used to do it earlier that you have pasted in the note :

val conf = new SparkConf().setMaster("local[2]").setAppName("CountingSheep")

val sc = new SparkContext(conf)

Now what you can do is

val conf = new SparkConf().setMaster("local[2]").setAppName("CountingSheep")

val spark = SparkSession.builder().config(conf).getOrCreate()

And if you want sc to be specific you can extract it from SparkSession variable itself.

val sc = spark.sparkContext

Now coming to your getOrCreate():

See we cannot have two SparkContext for the same application, same goes for the SparkSession so what getOrCreate() does is it searches if there is an already existing SparkContext or SparkSession object and if there is then just get its reference and provide it to the variable. If its not present then create one and pass it on to the variable.

P.S : I hope this explanation helps you. :)

0
votes

You code is of spark 2.0 or newer verions but the documentation you are referring is of older versions of spark.

In 2.0 version SparkSession was introduced which is a combination of SparkContext for computing rdds and SqlContext for computing SQL queries on dataframes. So the api is changed.

In older versions of spark, SparkConf is used to set the configuration, SparkContext is used to set contexts for rdds and SQLContext is used to set contexts for dataframes and datasets.

val conf = new SparkConf().setMaster("local[2]").setAppName("CountingSheep").set("spark.some.config.option", "some-value")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

But in newer version greater than 2.0, it is done as

val spark = SparkSession.builder().appName("Spark SQL basic example").config("spark.some.config.option", "some-value").getOrCreate()
val sc = spark.sparkContext
val sqlContext = spark.sqlContext

Now spark.some.config.option and some-value are key value pairs of configuation properties. for example spark configuration properties and yarn properties etc.

spark.some.config.option = textinputformat.record.delimiter 
some-value = ~%$

I hope the answer is helpful

0
votes

I try to convert a JSON file to a conf file structure using "io.circe" library.

My implementation is like:

Step: 1

.sbt file looks like:

organization := "com.sample.json.convert"

name := "convert-json-conf"

version := "1.0.0"

scalaVersion := "2.11.8"

scalacOptions ++= Seq("-feature")
javacOptions ++= Seq("-source", "1.8", "-target", "1.8", "-Xlint")

libraryDependencies ++= Seq(
 "org.apache.spark" %% "spark-core" % "2.3.0" % "provided",
 "io.circe" %% "circe-core" % "0.10.0-M2",
 "io.circe" %% "circe-parser" % "0.10.0-M2",
 "io.circe" %% "circe-generic" % "0.10.0-M2",
 "io.circe" %% "circe-config" % "0.4.1"
)

Step: 2

Sample input JSON file:

{
    "id": "0001",
    "address" : {
        "country": {
            "state":{
                "city":{
                    "locality": "some_place"
                }
            }
        }
    },
    "type": "donut",
    "name": "Cake",
    "ppu": 0.55,
    "batters":
        {
            "batter":
                [
                    { "id": "1001", "type": "Regular" },
                    { "id": "1002", "type": "Chocolate" },
                    { "id": "1003", "type": "Blueberry" },
                    { "id": "1004", "type": "Devil's Food" }
                ]
        },
    "topping":
        [
            { "id": "5001", "type": "None" },
            { "id": "5002", "type": "Glazed" },
            { "id": "5005", "type": "Sugar" },
            { "id": "5007", "type": "Powdered Sugar" },
            { "id": "5006", "type": "Chocolate with Sprinkles" },
            { "id": "5003", "type": "Chocolate" },
            { "id": "5004", "type": "Maple" }
        ]
}

Step: 3

JSON File Reading code:

    sealed trait FileRead [T, R] extends (T => R) with Serializable

    object FileRead {

      implicit object FileReadImpl extends FileRead[String, String] {
        override def apply(inputSource: String): String = {
          val source = scala.io.Source.fromFile(inputSource)
          try source.mkString finally source.close()
       }
      }
    }

Step: 4

JSON to Config file conversion code:

import io.circe.Json
import io.circe.config.{parser, printer}

sealed trait JsonToConfig[T, R] extends (T => R) with Serializable

object JsonToConfig {

  implicit object JsonToConfigImpl extends JsonToConfig[String, String] {
    override def apply(input: String): String = {

      val options = printer.DefaultOptions.setFormatted(false)
      val jsonFormatInput = parser.parse(input).right.get
      val inputJson = Json.fromJsonObject(jsonFormatInput.asObject.get)

      printer.print(inputJson, options)
    }
  }
}

Step 5:

Execute the program:

object ConversionRun {
  def main(args: Array[String]): Unit = {

    val file = "sample.json"

    import FileRead._
    import JsonToConfig._
    val fileRead = implicitly[FileRead[String, String]]
    val convert = implicitly[JsonToConfig[String, String]]

    val out = convert(fileRead(file))

    println(out)
  }
}

Output:

address{country{state{city{locality="some_place"}}}},batters{batter=[{id="1001",type=Regular},{id="1002",type=Chocolate},{id="1003",type=Blueberry},{id="1004",type="Devil's Food"}]},id="0001",name=Cake,ppu=0.55,topping=[{id="5001",type=None},{id="5002",type=Glazed},{id="5005",type=Sugar},{id="5007",type="Powdered Sugar"},{id="5006",type="Chocolate with Sprinkles"},{id="5003",type=Chocolate},{id="5004",type=Maple}],type=donut