I've set up a small EMR cluster with Hive/Presto installed, I want to query files on S3 and import them to Postgres on RDS.
To run queries on S3 and save the results in a table in postgres I've done the following:
- Started a 3 node EMR cluster from the AWS console.
- Manually SSH into the Master node to create an EXTERNAL table in hive, looking at an S3 bucket.
Manually SSH into each of the 3 nodes and add a new catalog file:
/etc/presto/conf.dist/catalog/postgres.propertieswith the following contents
connector.name=postgresql connection-url=jdbc:postgresql://ip-to-postgres:5432/database connection-user=<user> connection-password=<pass>and edited this file
/etc/presto/conf.dist/config.propertiesadding
datasources=postgresql,hiveRestart presto by running the following manually on all 3 nodes
sudo restart presto-server
This setup seems to work well.
In my application, there are multiple databases created dynamically. It seems that those configuration/catalog changes need to be made for each database and the server needs to be restarted to see the new config changes.
Is there a proper way for my application (using boto or other methods) to update configurations by
- Adding a new catalog file in all nodes /etc/presto/conf.dist/catalog/ for each new database
- Adding a new entry in all nodes in /etc/presto/conf.dist/config.properties
- Gracefully restarting presto across the whole cluster (ideally when it becomes idle, but that's not a major concern.
