3
votes

I have S3 bucket that contains csv files (see 'Data sample').

The files were created as a result of an Athena query.

I want to treat the 'time' column as decimal(17,7) and the 'size' column as bigint.

I define a table (see 'Table definition') and using the query (see 'Query' below) I get the 'Error' below.

My questions are:
Is there a way to control the Athena query output format and get a csv without the quotes around the fields? (1545980216.9581780,1349 vs "1545980216.9581780","1349")

Assuming the answer to the previous question is negative. How can I query the data in the table '`del_me_later_4' and treat the fields as decimal and bigint.

Data sample

"time","size"
"1545751457.8957720","432"
"1545751458.0753407","583"
"1545751458.1407920","583"
"1545751458.1683733","560"

Table definition

CREATE EXTERNAL TABLE `del_me_later_4`(
  `time` string, 
  `size` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://del-me-later/'
TBLPROPERTIES (
  'skip.header.line.count'='1'  
)

Query

SELECT cast(time AS decimal(17,7)) as time,
       cast(size AS bigint) as size  
FROM "del_me_later_4"

Error

INVALID_CAST_ARGUMENT: Cannot cast VARCHAR '"1545751457.8957720"' to DECIMAL(17, 7)
1

1 Answers

1
votes

I found a solution. I use org.apache.hadoop.hive.serde2.OpenCSVSerde.

When you use Athena with OpenCSVSerde, the SerDe converts all column types to STRING. Next, the parser in Athena parses the values from STRING into actual types based on what it finds. For example, it parses the values into BOOLEAN, BIGINT, INT, and DOUBLE data types when it can discern them. If the values are in TIMESTAMP in the UNIX format, Athena parses them as TIMESTAMP. If the values are in TIMESTAMP in Hive format, Athena parses them as INT. DATE type values are also parsed as INT.

See https://docs.aws.amazon.com/athena/latest/ug/csv.html

''' CREATE EXTERNAL TABLE `{}`(
  `time` string, 
  `size` string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ","
)
LOCATION
  's3://{}/{}'
TBLPROPERTIES (
  'skip.header.line.count'='1'  
)