How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFS?
df.show() --> 2 rows
+------+--------------+----------------+
| name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa| null| [3, 9, 15, 20]| | Ben| red| []|
+------+--------------+----------------+
df.rdd.getNumPartitions() - it has 1 partition
>>> df.rdd.getNumPartitions()
1
df.write.save("/user/hduser/data_check/test.parquet", format="parquet")
If I use the above command to create parquet file in HDFS, it is creating directory "payloads.parquet" in HDFS and inside that directory multiple files .parquet file, metadata file are getting saved.
Found 4 items
-rw-r--r-- 3 bimodjoul biusers 0 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_SUCCESS
-rw-r--r-- 3 bimodjoul biusers 494 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_common_metadata
-rw-r--r-- 3 bimodjoul biusers 862 2017-03-15 06:47
/user/hduser/data_check/test.parquet/_metadata
-rw-r--r-- 3 bimodjoul biusers 885 2017-03-15 06:47
/user/hduser/data_check/test.parquet/part-r-00000-f83a2ffd-38bb-4c76-9f4c-357e43d9708b.gz.parquet
How to write data in the dataframe into single .parquet file(both data & metadata in single file) in HDFSrather than folder with multiple files?
Help would be much appreciated.