0
votes

Is it possible to have a Glue job re-classify a JSON table as Parquet instead of needing another crawler to crawl the Parquet files?

Current set up:

  1. JSON files in partitioned S3 bucket are crawled once a day
  2. Glue Job creates Parquet files in specified folder
  3. Run ANOTHER crawler to RECREATE the same table that was made in step 1

I have to believe that there is a way to convert the table classification without another crawler (but I've been burned by AWS before). Any help is much appreciated!

1
Why do you crawl (1st crawler) the json files every day? are there schema changes?ya2410
Yes, I'm tracking page views, so there will be days when certain pages aren't viewed/no actions associated with the page are tracked.eleanore
Can you tell me why you want to use the same crawler.If it is pricing then the billing is calculated based on crawler run time and not based on the number of crawlers.Prabhakar Reddy
It's not so much as using the same crawler as it is the redundancy. I already have a table, why wouldn't the Glue Job just convert the file type of the table as opposed to creating new, unpartitioned files that would then need to be recrawled? (should be noted, I'm not a developer and am fairly new to AWS/Glue, so this might be a basic misunderstanding on my behalf)eleanore
hi @eleanore, in step 2, the parquet files are created in a different folder than the JSON, correct?Saiful Rizal MDRamli

1 Answers

0
votes

For convenience considerations - 2 crawlers is the way to go.

For cost considerations - a hacky solution whould be:

  1. Get the json table's CREATE TABLE DDL from Athena using SHOW CREATE TABLE <json_table>; command;

  2. In the CREATE TABLE DDL, Replace the table name and the SerDer from json to parquet. You don't need the other table properties from the original CREATE TABLE DDL except LOCATION.

  3. Execute the new CREATE TABLE DDL in Athena.

For example:

SHOW CREATE TABLE json_table;

Original DDL:

CREATE EXTERNAL TABLE `json_table`(
  `id` int COMMENT, 
  `name` string COMMENT)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
...
LOCATION
  's3://bucket_name/table_data'
...

New DDL:

CREATE EXTERNAL TABLE `parquet_table`(
  `id` int COMMENT, 
  `name` string COMMENT)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
LOCATION
  's3://bucket_name/table_data'

You can also do it in the same way with Glue api methods: get_table() > replace > create_table().

Notice - if you want to run it periodically you would need to wrap it in a script and scheduled it with another scheduler (crontab etc.) after the first crawler runs.