I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.
Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns.
For example, table1
has columns A1....A1000
, table2
has columns A1,A3,A100,...A1000
, and A1001
(so some columns are common, some are unique for each table).
Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)?
Like Total_table
with columns A1...A1001
?
UPD: printSchema() command would show something like this:
root
|-- aaa: struct (nullable = true)
| |-- ids: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- bb1: string (nullable = true)
| | | |-- bb2: string (nullable = true)
| | | |-- bb3: string (nullable = true)
When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.
In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.