We are planning to use the dynamic partitioning feature in Hive for one of our projects. I understand that this parameter needs to be setup for this to work:
hive.exec.dynamic.partition.mode=nonstrict
In our cluster this is set to strict. We are working on having this changed but in the meanwhile we were planning to do this as a work-around:
- Create a fixed column that will always have the same hard-coded value and use this as the first static column for partitioning
- Use the columns for dynamic partitioning after this static column
This definitely takes away the issue of setting up the above parameter. Hive just needs one static column and is happy to partition dynamically for the other columns
I noticed that, as expected, hive creates a HDFS folder with the static partition and then creates the folder for dynamic partitions under that. Something like this:
/baseDir/staticColumn=staticValue/dynamicColumn=dynamicValue1
/baseDir/staticColumn=staticValue/dynamicColumn=dynamicValue2
So the solution pushes the the actual data one level down in HDFS, which does not seem to be an issue/concern
The question I have is, is there any downside to this solution? From a performance, reliability point of view?
set hive.exec.dynamic.partition.mode=nonstrict ; insert into table X partition (PTKEY) select A, B, C, PTKEY from Z ;(unless your admin defined explicitly the param as "final" in the config file, but I can't see why he/she would do that) -- cf. cwiki.apache.org/confluence/display/Hive/Tutorial - Samson Scharfrichter