0
votes

I am using PutHive3Streaming to load avro data from Nifi to Hive. For a sample, I am sending 10 MB data Json data to Nifi, converting it to Avro (reducing the size to 118 KB) and using PutHive3Streaming to write to a managed hive table. However, I see that the data is not compressed at hive.

hdfs dfs -du -h -s /user/hive/warehouse/my_table*
32.1 M  /user/hive/warehouse/my_table  (<-- replication factor 3)

At the table level, I have:

STORED AS ORC
  TBLPROPERTIES (
    'orc.compress'='ZLIB',
    'orc.compression.strategy'='SPEED',
    'orc.create.index'='true',
    'orc.encoding.strategy'='SPEED',
    'transactional'='true');

and I have also enabled:

hive.exec.dynamic.partition=true
hive.optimize.sort.dynamic.partition=true
hive.exec.dynamic.partition.mode=nonstrict
hive.optimize.sort.dynamic.partition=true
avro.output.codec=zlib
hive.exec.compress.intermediate=true;
hive.exec.compress.output=true;

It looks like despite this, compression is not enabled in Hive. Any pointers to enable this?

2
Are those properties set in the hive-site.xml on the NiFi node(s) and are you pointing to it in the Hive Configuration Resources property? If those are server-side properties, then perhaps there are client-side versions? - mattyb
At client-side, i.e nifi puthive3streaming, I have Hive configuration resources set to a conf folder consisting of the hive-site.xml file with the same properties I mentioned above. But I am curious if the exact same version is being used. I will check this and update tomorrow. - irrelevantUser
I can confirm that I am using the same version (same properties) of hive-site.xml on both nifi and hive.. - irrelevantUser
Hi, I came across a JIRA link which I believe is the cause of the issue. I observe that the compaction job is run as 'nifi' user (doAS). Whereas compaction itself succeeds, delta files are not deleted by the compactor.cleaner, which i think is the reason for the observation I put forth in the original question. - irrelevantUser

2 Answers

1
votes

Hive does not compress datas which inserted by Streaming Data Ingest API.
They'll be compressed when compaction runs.
See https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest+V2#StreamingDataIngestV2-APIUsage

If you dont' wanna wait, use ALTER TABLE your_table PARTITION(key=value) COMPACT "MAJOR".

0
votes

Yes, @K.M is correct in so far that Compaction needs to be used.

a) Hive compaction strategies need to be used to manage the size of the data. Only after compaction is the data encoded. Below are the default properties for auto-compaction.

hive.compactor.delta.num.threshold=10
hive.compactor.delta.pct.threshold=0.1

b) Despite this being default, one of the challenges I had for compaction is that the delta files written by nifi were not accessible(delete-able) by the compaction cleaner (after the compaction itself). I fixed this by using the hive user as the table owner as well as giving the hive user 'rights' to the delta files as per standards laid out by kerberos.

d) Another challenge I continue to face is in triggering auto compaction jobs. In my case, as delta files continue to get streamed into hive for a given table/partition, the very first major compaction job completes successfully, deletes deltas and creates a base file. But after that point, auto-compact jobs are not triggered. And hive accumulates a huge number of delta files. (which have to be cleaned up manually <--- not desirable)