We are investigating whether Hive will allow us to run some SQL-like queries on mongo style dynamic schema as a precursor to our map-reduce jobs.
The data comes in the form of several TiB of BSON files; each of the files contains JSON "samples". An example sample is given as such:
{
"_id" : "SomeGUID",
"SomeScanner" :
{
"B64LR" : 22,
"Version" : 192565886128245
},
"Parser" :
{
"Size" : 73728,
"Headers" :
[
{
"VAddr" : 4096,
"VSize" : 7924.
. . . etc. . . .
As a dynamic schema, only a few of the fields are guaranteed to exist.
We would like to be able to run a query against an input set that may be something like
SomeScanner.Parser.Headers.VSize > 9000
Having looked up the table-mapping, I'm not sure whether this is do-able with Hive . . . how would one map a column that may or may not be there . . . not to mention that there are about 2k-3k query-able values in a typical sample.
Hence, my questions to the Experts:
- Can Hive build a dynamic schema from the data it encounters?
- How can one go about building a Hive table with ~3k columns?
- Is there a better way?
Appreciated, as always.