Folks,
What are the recommended file format that can be used in different phases of Hadoop processing.
Processing : I have been using text format / JSON serde in hive to do the processing. Is this a good format for staging table where i perform the ETL (Transformation) operation ? is there a better formats which i should be using ? I know Parquet / ORC / AVRO are specialized format but does it fit well for ETL(Transformation) operation . Also if i use a compression technique such as Snappy for Zlib would that be a recommended approach(I don't want to loose performance due to the extra CPU utilization because of compression , Correct me if compression would have a better performance)
Reporting : Depending upon my query needs
Aggregation :
using a columnar storage seems to be a logical solution. Does Parquet with Snappy compression a good fit (Assuming my hadoop distribution is Cloudera).
Complete row fetch
If my query pattern needs all columns in a row , would choosing a columnar storage be a wise decision ? Or should i choose AVRO file format
Archive : For archiving data i plan to use AVRO as it handles schema evolution with good compression.