0
votes

I have a dataset in spark partitioned by a column ip. Now I want to divide this dataset into 2 and write in HDFS such that if total partitions are 100 ie ip=1 to ip=100 then each HDFS directory should contain 50 partitions each finally.

Input:

mydata/
mydata/ip=1
mydata/ip=2
mydata/ip=3
mydata/ip=4
.
.
mydata/ip=101

Result

mydata1/
mydata1/ip=1
mydata1/ip=3
.
.
mydata1/ip=50


mydata2/
mydata2/ip=51
mydata2/ip=4
mydata2/ip=100

Also while writing out how can I ensure that each directory mydata1 and mydata2 contain an equal distribution of data in terms of size. That means both directories should contain eg 25Gb or data, There shouldn't be a case where mydata1 contains 1GB and mydata2 contain49GB

Thanks

1

1 Answers

0
votes

Yes, you can be using bucketing. Read more about buckinting: https://dwgeek.com/spark-sql-bucketing-on-dataframe-examples.html/