0
votes

I want to create a table with static data such as country codes and names in HDFS. I will use a csv to load the data into the system. It doesn't matter if I drop the table and the data because this is information you can easily find on the Internet.

Is there any performance consideration about external/internal tables for this type of data? Should I stick with external tables like all the people in this post says?

1
Performance wise, there should not be a difference. In the post you linked people have pointed out the main difference - when you drop a table, and external table will not remove the data, while an internal table will.Stephen ODonnell

1 Answers

0
votes

As Stephen ODonnell pointed out in the comments, internal/external is really more about the location of the data and what manages it.

I would say there are other important performance factors to consider, for example the table format and whether or not compression is to be used.

The following is from an HDP perspective; for Cloudera the general concept is the same, but the specifics would probably differ.)

For example, you could define the table as being in ORC Format, which offers many optimizations, such as predicate pushdown that allows rows to be optimized out at the storage layer before they are even added into the SQL processing layer. More details on that.

Another option would be whether or not you want to specify compression, such as Snappy, a compression algorithm which balances speed and compression ratio (see ORC link above for more info).

Generally speaking, I treat the HDFS data as a source, and sqoop it into Hive into a managed (internal) table with with ORC format and snappy compression enabled. I find that provides good performance with the added benefit that any ETL can be done to this data without regard for the original source data in HDFS, since it was copied into Hive during the sqoop.

This does of course require extra space, which may be a consideration depending on your environment and/or specific use case.