1
votes

We are investigating whether Hive will allow us to run some SQL-like queries on mongo style dynamic schema as a precursor to our map-reduce jobs.

The data comes in the form of several TiB of BSON files; each of the files contains JSON "samples". An example sample is given as such:

{
    "_id" : "SomeGUID",
    "SomeScanner" : 
    {
        "B64LR" : 22,
        "Version" : 192565886128245
        },
        "Parser" : 
        {
            "Size" : 73728,
            "Headers" : 
            [
                {
                    "VAddr" : 4096,
                    "VSize" : 7924.                  
. . . etc. . . .

As a dynamic schema, only a few of the fields are guaranteed to exist.

We would like to be able to run a query against an input set that may be something like

SomeScanner.Parser.Headers.VSize > 9000 

Having looked up the table-mapping, I'm not sure whether this is do-able with Hive . . . how would one map a column that may or may not be there . . . not to mention that there are about 2k-3k query-able values in a typical sample.

Hence, my questions to the Experts:

  • Can Hive build a dynamic schema from the data it encounters?
  • How can one go about building a Hive table with ~3k columns?
  • Is there a better way?

Appreciated, as always.

1

1 Answers

2
votes

OK--with much ado, I can now answer my own questions.

  • Can Hive build a dynamic schema from the data it encounters? A: No. However, an excellent tool for this exists. q.v., inf.

  • How can one go about building a Hive table w/~3K columns A: Ibidem

  • Is there a better way? A: Not that I found; but, with some help, it isn't too difficult.

First, a shout out to Michael Peterson at http://thornydev.blogspot.com/2013/07/querying-json-records-via-hive.html, whose blog post served as the toe-hold to figure all of this out.

Definitely check it out if you're starting out w/Hive.

Now, Hive cannot natively import a JSON document and deduce a schema from it . . . however, Michael Peterson has developed tool that does: https://github.com/midpeter444/hive-json-schema

Some caveats with it: * Empty arrays and structs are not handled, so remove (or populate) them. Otherwise, things like { "something" : {} } or {"somethingelse": []} will throw errors.

  • If any field has the name "function", it will have to be re-named prior to executing the CREATE TABLE statement. E.g., the following code would throw an error: 1{ "someThing": { "thisIsOK": "true", "function": "thatThrowsAnError" } }`

Presumably, this is because "function" is an Hive keyword.

  • And, with dynamic schema in general, I have not found a way to handle a nested leading underscore name even if the schema is valid: { "somethings": { "_someVal": "123", "otherVal": "456" } } will fail.

  • For the common MongoDB "ID" field, this is map-able with the following addition: with serdeproperties("mapping.id" = "_id"), which appears to be similar to a macro-substitution.

Serialization/De-Serialization for JSON can be achieved with https://github.com/rcongiu/Hive-JSON-Serde by adding the following: ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'

N.B., the JsonSerDe JAR must have been added to .hiverc or "add jar"'d into Hive to be used.
Thus, the schema:

CREATE TABLE samplesJSON 
    ( id string,
       . . . rest of huge schema . . . . )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH serdeproperties("mapping.id" = "_id");

The JSON data can be loaded into the table with a command along the lines of:

LOAD DATA LOCAL INPATH '/tmp/samples.json' OVERWRITE INTO TABLE samplesJSON;

Finally, queries are actually intuitive straight-forward. Using the above example from the original question:

hive> select id, somescanner.parser.headers.vaddr from samplesjson;
OK
id    vaddr
119   [4096,53248,57344]