0
votes

When integrating spark and mongodb, it is possible to provide a sample schema in a form of an object - as described here: https://docs.mongodb.com/spark-connector/master/scala/datasets-and-sql/#sql-declare-schema

As a short-cut, there is a sample code how one can provide mongodb spark connector with sample schema:

    case class Character(name: String, age: Int)
    val explicitDF = MongoSpark.load[Character](sparkSession)
    explicitDF.printSchema()

I have a collection, which has a constant document structure. I can provide a sample json, however to create a sample object manually will be impossible (30k properties in a document, 1.5MB average size). Is there a way how spark would infer schema just from that very json and would circumvent Mongodb connector's initial sampling which is quite exhaustive?

1

1 Answers

1
votes

Spark is able to infer the schema, especially from sources having it as MongoDB. For instance for RDBMS it executes a simple query returning nothing but table columns with their types (SELECT * FROM $table WHERE 1=0).

For the sampling it'll read all documents unless you specify the configuration option called samplingRatio like this:

sparkSession.read.option("samplingRatio", 0.1)

For above Spark will only read 10% of the data. You can of course set any value you want. But be careful because if your documents have inconsistent schemas (e.g. 50% have a field called "A", the others not), the schema deduced by Spark may be incomplete and at the end you can miss some data.

Some time ago I wrote a post about schema projection if you're interested: http://www.waitingforcode.com/apache-spark-sql/schema-projection/read