I have pipe delimited un-partitioned data as text files on Amazon S3 which I have unloaded from redshift to clear space up on our redshift nodes. I have created a table in Athena to access this data, but I would like to optimize my data for performance. I am trying to convert my data to parquet format using AWS EMR and store them as parquet formatted files on an s3 bucket. I tried to follow instructions on the AWS website, but I find the instructions a little confusing.
2 Answers
Now you can simplify the process by using the Redshift Unload to export data in Parquet format. You will not need EMR or Athena:
UNLOAD ('select-statement')
TO 's3://object-path/name-prefix'
FORMAT PARQUET
Documentation can be found at UNLOAD - Amazon Redshift
Without some specific details about the data you are trying to convert or the errors you are running into, it's difficult to zero in on exactly what would be most helpful to you. But hopefully the following tips from my experience will help:
1. Figure out Raw Data Appearance Likely, the data you will be converting will be stored in Amazon S3 buckets, typically with each distinct table stored in a separate folder and each folder dividing the contents of that data into zipped text files. Understanding the format is critical to writing the correct Hive commands in the next step.
2. Write Parquet Conversion DDL Script The key element in the Parquet conversion process is the Hive-based DDL (Data Definition Language) script, which describes the schema of the table (including data types) and points to where the input data is stored and where the output data should end up. Typically, a "ddl" file that describes the infrastructure of the raw data is provided alongside the data or the schema is described somewhere. As an example, here is a Parquet conversion DDL script for an s3 bucket "folder" with a specific schema:
ADD JAR /usr/lib/hive-hcatalog/share/hcatalog/hive-hcatalog-core-1.0.0-amzn-5.jar;
CREATE EXTERNAL TABLE offer (
offer_sid VARCHAR(36),
created TIMESTAMP,
offer_id VARCHAR(100),
offer_position VARCHAR(100),
page_position VARCHAR(100),
vertical VARCHAR(100),
partner VARCHAR(100),
card_name VARCHAR(255))
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://yourdata/raw/table' ;
CREATE EXTERNAL TABLE parquet_hive (
offer_sid VARCHAR(36),
created TIMESTAMP,
offer_id VARCHAR(100),
offer_position VARCHAR(100),
page_position VARCHAR(100),
vertical VARCHAR(100),
partner VARCHAR(100),
card_name VARCHAR(255))
STORED AS PARQUET
LOCATION 's3://yourdata/parquet/table/';
INSERT OVERWRITE TABLE parquet_hive SELECT * FROM offer;
Here, the first line indicates the Hive version to be used in interpreting the forthcoming statements. It is very important to note that the version of Hive you choose here will determine which data types are supported. Check here to see the different Hive versions offered by Amazon.
The next block of lines creates the table as specified by the DDL statements. The location of the raw data is indicated with "LOCATION" and more importantly, the format of the data (determined in step (1)) is indicated with "ROW FORMAT." Here, the SerDe (Serializer/Deserializer) interface OpenCSVSerde nicely handles data values in rows that are stored in double quotes. By default, OpenCSVSerde uses double quotes as the quote character and commas as the delimiter. Note that not all raw data formats will require this SerDe interface; the double quote situation is a bit of a special case.
The final block of lines defines the schema of the final Parquet table (indicated with "STORED AS PARQUET"), the output location where the Parquet files will end up, and then writes the table with data from the first table.
3. Execute Parquet Script on EMR Cluster Once you have written the table-specific Parquet conversion script, upload it to the appropriate S3 bucket in a separate folder (in s3://yourdata/parquet_scripts as write-parquet-table.q in this example). The script will be executed on an EMR cluster with Hive installed. Here is an example command to start up an EMR cluster to execute the script using AWS CLI (command line interface):
aws emr create-cluster --enable-debugging --log-uri s3://yourdata/parquetLogs --applications Name=Hadoop Name=Hive Name=HCatalog --ec2-attributes KeyName=your_pemkey,InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-888777ff --service-role EMR_DefaultRole --release-label emr-5.3.0 --instance-type m4.xlarge --instance-count 1 --steps Type=HIVE,Name="Convert offer to Parquet",ActionOnFailure=CONTINUE,ActionOnFailure=TERMINATE_CLUSTER,Args=[-f,s3://yourdata/parquet_scripts/write-parquet-table.q,-hiveconf,INPUT=s3://yourdata/raw/table/,-hiveconf,OUTPUT=s3://yourdata/parquet/table,-hiveconf,REGION=us-east-1] --region us-east-1 --auto-terminate --profile your_profile
Along with supplying your s3 bucket deails, pemkey name and aws cli profile, you will need to change the SubnetId=subnet-888777ff to a valid subnet. For now, identify an available VPC subnet in your AWS dashboard and replace subnet-888777ff in the command above with its id.
This particular command spins up an EMR cluster running on an m4.xlarge EC2 instance, writes debugging logs to s3://yourdata/parquetLogs, executes the Parquet conversion script "write-parquet-table.q," and uses the AWS CLI credentials profile "your_profile". Depending on how much raw data there is and the size of instance you use, the amount of time it takes to complete the Parquet conversion process will vary considerably. After the script finishes, the EMR cluster will terminate itself and the Parquet-formatted data will be ready to be imported into an interface such as Amazon Athena.