0
votes

Folks,
What are the recommended file format that can be used in different phases of Hadoop processing.

Processing : I have been using text format / JSON serde in hive to do the processing. Is this a good format for staging table where i perform the ETL (Transformation) operation ? is there a better formats which i should be using ? I know Parquet / ORC / AVRO are specialized format but does it fit well for ETL(Transformation) operation . Also if i use a compression technique such as Snappy for Zlib would that be a recommended approach(I don't want to loose performance due to the extra CPU utilization because of compression , Correct me if compression would have a better performance)

Reporting : Depending upon my query needs
Aggregation : using a columnar storage seems to be a logical solution. Does Parquet with Snappy compression a good fit (Assuming my hadoop distribution is Cloudera).
Complete row fetch If my query pattern needs all columns in a row , would choosing a columnar storage be a wise decision ? Or should i choose AVRO file format

Archive : For archiving data i plan to use AVRO as it handles schema evolution with good compression.

1

1 Answers

0
votes

Choosing the file format depends on the usecase. You are processing data in hive hence below are the recommendation.

Processing : Use ORC for processing as you are using aggregation and other column level operation. It will help in increasing performance many fold.

Compression : Using it wisely on case basis will help in increasing performance by reducing expensive IO operation time.

If use case is row based operation then using Avro is recommended.

Hope this will help in taking decision.