I have a dataframe record updated everytime a process runs, that means i will have a dataframe of one row and 4 columns every time the process completes. Then I will insert it into the hive table using dataframe write and in parquet format. Because of one record at a time, I'm seeing so many small files in the table folder in hfds.
Could you please let me know how to reduce and write it to same file (parquet file) when I'm writing the data to the hive table??
hdfs location: user_id/employe_db/market_table/
from:
part-04498-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04497-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04496-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04496-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04450-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
part-04449-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
to:
part-03049-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet
How to reduce the number of parquet files to fixed no# of less files and load/write the new data into the existing files?? part-04499-f33fc4b5-47d9-4d14-b37e-8f670cb2c53c-c000.snappy.parquet