I am trying to process a month's worth of website traffic, which is stored in an S3 bucket as json (one json object per line/website traffic hit). The amount of data is big enough that I can't ask Spark to infer the schema (OOM errors). If I specify the schema it loads fine obviously. But, the issue is that the fields contained in each json object differ, so even if I build a schema using one day's worth of traffic, the monthly schema will be different (more fields) and so my Spark job fails.
So I'm curious to understand how others deal with this issue. I can for example use a traditional RDD mapreduce job to extract the fields I'm interested in, export and then load everything into a dataframe. But this is slow and seems a bit like self-defeating.
I've found a similar question here but no relevant info for me.
Thanks.