Hive to vertica data export with Unix named pipe

Question

Can someone please help me that how to do large and fast export to Hive to Vetica without any hadoop connector?

Currently i am exporting same thing via unix Namedpipe but performance is not that good .

almost 5 parallel thread to load the data into vertica and time is approx 230 min for 1.6 billion record set ?

can some one please help me to improve this performance and if we can optimise this export ?

Thanks abhi

It's unclear what you're asking, maybe provide some example code or further details. — EternalHour
hey , actually we are planning to migrate big tables from hivr like 3 billion records from hive to vertica with the help of unix named pipe means first select col1 col2 from hive tables > mkfifo after this we do vertica connection and start copy like below cat mkfifo | copy to vertica we are doing this exeution with 5 parallel thread on 5 nodes in vertica. — abhishek rastogi
@abhishekrastogi is this a one time thing or will happen frequently? — Kermit

Guillaume Guillaume · Accepted Answer · 2014-11-20T08:05:37

We are doing this, not using a named pipe (mkfifo) but a standard anonymous shell pipe:

hive -e "select whatever FROM wherever" | \
dd bs=1M | \
/opt/vertica/bin/vsql -U $V_USERNAME -w $V_PASSWORD -h $HOST $DB -c \
"COPY schema.table FROM LOCAL STDIN DELIMITER E'\t' NULL 'NULL' DIRECT"

This works perfectly fine for us. Note the 'dd' between hive and vsql. This is mandatory to have it working properly. It is hard to give you good numbers with this because our Hive select statement is actually not trivial, and I do not know where the time was spent (hive processing or data loading).

But tbh, using a named pipe as you do or a unnamed pipe as we do is a good way to do it, and there is not much you can optimise at system level. There are a few things to take in consideration, though:

time to compute your hive query
where you run your query. If you run it form a 3rd party machine, for instance, data needs to flow from hive to your server to vertica. Running the command on the hive server or on a Vertica node might speed things up by skipping an unnecessary hop.
COPY statement: do you use DIRECT?
and of course, usual projections (multiple projections slow the load down), Vertica resources and so on.

Hive to vertica data export with Unix named pipe

1 Answers