Spark SQL - Exact difference between Creating schema implicitly & Programmatically

Question

I am trying to understand the exact difference and which Method can be used in what particular Scenario between Creating Schema Implicitly & Programmatically.

On Databricks site the information is not that much elborative & explanatory.

As we can see that when using Reflection(implicit RDD to DF) way we can create a Case Class by choosing specific columns from a textfile by using the Map function.

And in Programmatic Style - we are loading the Dataset a textfile (similar to reflection)

Creating a SchemaString (String) = "Knowing the file we can specify the columns we need " (Similar to case class in Reflection way)

Importing the ROW API - which will again Map to the Specific Columns & data types used in Schema String (Similar to case classes)

Then we create DataFrame & after this everything is same.. So what is the exact difference in these two approaches.

http://spark.apache.org/docs/1.5.2/sql-programming-guide.html#inferring-the-schema-using-reflection

http://spark.apache.org/docs/1.5.2/sql-programming-guide.html#programmatically-specifying-the-schema

Please Explain...

Roberto Congiu Roberto Congiu · Accepted Answer · 2016-01-31T00:42:30

The produced schemas are the same, so from that point of view, there's no difference. In both cases, you're supplying a schema for your data, but in one case, you're doing it from a case class, in the other you can use collections, since a schema is built as a StructType(Array[StructField]). So it's basically a choice between tuples and collections. The way I see it, the biggest difference is that cases classes have to be in the code, while programmatically specifying the schema can be done at runtime, so you could, for instance, build a schema based on another DataFrame that you're reading at runtime. As an example, I wrote a generic tool to "nest" data, reading from CSV, and transforming a set of prefixed field into an array of structs. Since the tool is generic, and the schema is known only at runtime, I used the programmatic approach. On the other hand, it's generally easier to code it with reflection, since you don't have to deal with all the StructField objects, since they are derived from the hive metastore their data type has to be mapped to your scala types.

Spark SQL - Exact difference between Creating schema implicitly & Programmatically

2 Answers