Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression

Question

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.

hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M  /user/hive/warehouse/my_table  (<-- replication factor 3)

At the table level, I have:

STORED AS ORC
  TBLPROPERTIES (
    'orc.compress'='ZLIB',
    'orc.compression.strategy'='SPEED',
    'orc.create.index'='true',
    'orc.encoding.strategy'='SPEED',
    'transactional'='true');

and I have also enabled:

hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;

It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?

Are those properties set in the hive-site.xml on the NiFi node(s) and are you pointing to it in the Hive Configuration Resources property? If those are server-side properties, then perhaps there are client-side versions? — mattyb
At client-side, i.e nifi puthive3streaming, I have Hive configuration resources set to a conf folder consisting of the hive-site.xml file with the same properties I mentioned above. But I am curious if the exact same version is being used. I will check this and update tomorrow. — irrelevantUser
I can confirm that I am using the same version (same properties) of hive-site.xml on both nifi and hive.. — irrelevantUser
Hi, I came across a JIRA link which I believe is the cause of the issue. I observe that the compaction job is run as 'nifi' user (doAS). Whereas compaction itself succeeds, delta files are not deleted by the compactor.cleaner, which i think is the reason for the observation I put forth in the original question. — irrelevantUser

K.M K.M · Accepted Answer · 2018-11-01T01:35:05

Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage

If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".

Apache Nifi 1.7.1 PutHive3Streaming Hive 3.0 - Managed table compression

2 Answers