0
votes

I'm using AWS cluster to process many JSON files with nested data into Hive flattened tables.

Some data sets are very big, so I need to split it into chunks and make separate small Hive table for each chunk, which leads this smaller tables to have different columns. For example, table1 has columns A1....A1000, table2 has columns A1,A3,A100,...A1000, and A1001 (so some columns are common, some are unique for each table).

Is there a way to append, union or join those small tables (could be hundreds of tables, with ~1e7 rows each one) into one big Hive table, which has all the columns from small tables (and NULLs in missing cells for each small table that didn't have that specific column originally)? Like Total_table with columns A1...A1001?

UPD: printSchema() command would show something like this:

root
 |-- aaa: struct (nullable = true)
 |    |-- ids: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- bb1: string (nullable = true)
 |    |    |    |-- bb2: string (nullable = true)
 |    |    |    |-- bb3: string (nullable = true)

When I flatten this to get a regular Hive table, depending on the set of json files that happened to be in each chunk, the number of elements in ids array (for example) can be different, and so output columns will be different too: like aaa_ids_bbb1_0 (for 0st element of the ids array),...,aaa_ids_bbb1_999 (for 999th element, if there happen to be 1000 elements). So tables in each chunk can be different.

In python pandas module, if I concatenate 3 dataframes with columns A,B,C (1st), A,C,D(2nd), and D,E,F(3rd), the resulting dataframe will have columns A,B,C,D,E,F. I need to do the same thing but in Hive SQL, if possible of course.

1
Please provide some short example of JSON and explain why did you create tables with different structureleftjoin
I can't provide json file example, those files are big, but I updated my question with explanation how chunks can have different structures.lugger1

1 Answers

0
votes

1) If your tables are JSONSerDe based then you can re-create tables with the same structure, absent elements will be null.

Simple UNION ALL:

select col1,  from table 
UNION ALL 
    select * from table 2 
    ...

will work fine in this case. Much better put all files in the same table location, it will work fine without UNION.

2) If not possible to apply first (preferable) option, then you need to list all columns in each table, providing nulls for absent columns:

select A1, ... A1000, null A1001, null A1002  from table 
UNION ALL 
select A1, ... A1000, A1001, null A1002 from table 2 
UNION ALL 
...