2
votes

As part of my requirement, I have to create a new Hive table and insert into it programmatically. To do that, I have the following DDL to create a Hive table:

CREATE EXTERNAL TABLE IF NOT EXISTS countData (
    tableName String,
    ssn String,
    hiveCount String,
    sapCount String,
    countDifference String,
    percentDifference String,
    sap_UpdTms String,
    hive_UpdTms String)
COMMENT 'This table contains record count of corresponding tables of all the source systems present on Hive & SAP'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '';

To insert data into a partition of a Hive table I can handle using an insert query from the program. Before creating the table, in the above DDL, I haven't added the "PARTITIONED BY" column as I am not totally clear with the rules of partitioning a Hive table. Couple of rules I know are

  1. While inserting the data from a query, partition column should be the last one.
  2. PARTITIONED BY column shouldn't be an existing column in the table.

Could anyone let me know if there are any other rules for partitioning a Hive table ? Also in my case, we run the program twice a day to insert data into the table and every time it runs, there could be 8k to 10k records. I am thinking of adding a PARTITIONED BY column for current date (just "mm/dd/yyyy") and inserting it from the code. Is there a better way to implement the partition idea for my requirement, if adding a date (String format) is not recommended ?

1

1 Answers

2
votes

What you mentioned is fine, but I would recommend yyyyMMdd format because it sorts better and is more standardized than seeing 03/05 and not knowing which is the day, and what is the month.

If you want to run it twice a day, and you care about the time the job runs, then do PARTITIONED BY (dt STRING, hour STRING)

Also, don't use STORED AS TEXT. Use Parquet or ORC instead.