mongo spark infer schema for large collections

Question

I am using mongo db spark connector (mongo-spark-connector_2.10) to read mongo documents. my question is regarding the schema inference.

I see that mongo spark is using MongoSinglePartitioner to infer schema. So when I try to sample big collection (few million documents) to infer schema it is very slow. Default sample size is 1000. Is there any reason why mongo spark is using SinglePartitioner to infer schema instead of using multiple partitions. I want to read all fields from a collection, so I am sampling large number of records from collection to infer schema. Right now for 1 million records schema inference is taking 20 minutes.

Is there any way I can specify different partitioner to infer schema to speed up schema inference ? or Are there any other approaches to infer schema from mongo for big collections.

Wan Bachtiar Wan Bachtiar · Accepted Answer · 2017-02-24T00:14:52

Are there any other approaches to infer schema from mongo for big collections.

Generally, if you have a large collection to load and already know the schema you would want to explicitly define the schema.

You could use a simple case class to define the schema, preventing extra queries and speed up the loading process. For example:

case class Creature(name: String, strength: Int, type: String)

val explicitDF = MongoSpark.load[Creature](sparkSession)()
explicitDF.printSchema()

mongo spark infer schema for large collections

1 Answers