2
votes

I have in my cloud, inside a S3 bucket, a CSV file with some data.

I would like to export that data into a DynamoDB table with columns "key" and "value".

Here's the current hive script I wrote:

CREATE EXTERNAL TABLE FromCSV(key string, value string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ', '
LOCATION 's3://mybucket/output/';                    

CREATE EXTERNAL TABLE hiveTransfer(col1 string, col2 string)
STORED BY 'org.apache.hadoop.hive.dynamodb.DynamoDBStorageHandler' 
TBLPROPERTIES ("dynamodb.table.name" = "InvertedIndex", 
"dynamodb.column.mapping" = "col1:key,col2:value");  

INSERT OVERWRITE TABLE hiveTransfer SELECT * FROM FromCSV;

Now, basically the script works. though I would like to make some modifications to this script as follows:

1) The script works only if the table "InvertedIndex" already exists in DynamoDB, I would like the script to create the new table by itself and then put the data as it already does.

2) In the CSV the key is always a string but I have 2 kinds of values, string or integer. I would like the script to distinguish between the two and make two different tables.

Any help with those two modifications will be appriciated. Thank you

1

1 Answers

3
votes

Hi this could be accomplished but it is not trivial case.

1) For creating dynamo table that can't be done by hive because Dynamo Tables are managed by Amazon cloud. One thing which gets in my mind is to create Hive UDF for creating dynamo table and call it inside some dummy query before running insert. For example:

 SELECT CREATE_DYNO_TABLE() FROM dummy;

Where dummy table has only one record.

2) You can split loading into two queries where in one query you will use RLIKE operator and [0-9]+ regular expression to detect numeric values and other just negation of that.

HTH, Dino