I have created a hive table with gender as bucket column.
create table userinfoBucketed(userid INT,age INT,gender STRING,occupation STRING,zipcode STRING) CLUSTERED BY(gender) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
loading following data from text file into table (user id | age | gender | occupation | zip code) :
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
I have set the hive.enforce.bucketing property to true; set hive.enforce.bucketing=true;
1, when inserted data into table using load command, buckets are not created. all the data stored in one bucket
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
Question1, Why the data is not split into 2 buckets?
2, when inserted data into table from other table, data stored in 2 buckets. here is the command I have executed:
insert into table userinfobucketed select * from userinfo where gender='M';
Now bucket1(000000_0) has below data: 1|24|M|technician|85711 4|24|M|technician|43537 6|42|M|executive|98101 7|57|M|administrator|91344
bucket2(000001_0) has below data: 3|23|M|writer|32067
Question2, I do not understand why data got stored into 2 buckets even though all same records has the same gender.
Then I again inserted data into the table using the below command. insert into table userinfobucketed select * from userinfo where gender='F';
Now 2 more extra buckets (000000_0_copy_1,000001_0_copy_1) are created and data stored into those instead of inserting data into existing buckets. Now that makes total buckets to 4 even though create table is configured into 2 buckets.
Question3 ; Why the extra buckets got created into of copying into existing buckets
please clarify
Thanks Sean