I read a lot about data formats for hadoop and currently seem to understand that depending on the package you are using, the most advanced formats are ORC (well supported on Hortonworks) or Parquet (well supported on Cloudera).
Now most examples/tutorials for beginners include simple csv data. One entry per row. Often they import the CSV into an SQL-like structure (Hive), without saving it as ORC.
JSON also seems to be supported by Hadoop, but not so well integrated. Also according to an overview article JSON is a bad format, because it cannot be split into chunks by lines. JSON lines does not seem to be supported natively.
My data is movie meta data looking like this:
{title: "Movie 1", rating: 4.3, tags: ["Romance", "Music"],
actors: ["Leonardo di Caprio"], source: "example.com"}
{title: "Movie 2", cinema_viewers: 10000000, budget: 10000000,
categories: ["Action"], role_importance: {'Adam Sandler': 2},
source: "example.net"}
How should I import my data, if I have a JSON lines structure? Does this heavily depend on the querying engine I want to use? Up to now I only learned about Hive and Pig. Seems both can be used with HCatalog schema or without. But I only used both on simple column data without lists (which in SQL would require some foreign key tables).
It would also be possible to split the data into multiple different files before importing - emulating the foreign key relationship like in SQL. Or do we always keep tightly coupled data in one single file if possible?
My mental problem seems to be, that I do not understand the whole transformation along the way: the format in which I should store data to files, which can then be imported using a tabular abstraction, saved as another file (OCR), which will then be queried with languages from a different domain (SQL like Hive or Pig), which might get translated to MapReduce or some other intermediate layer (Spark).
Disclaimer: I used Hadoop as the name for the whole data mining environment including all querying APIs like Hive and Pig, not only for the file distribution system.