How to add connectors to presto on Amazon EMR

Question

I've set up a small EMR cluster with Hive/Presto installed, I want to query files on S3 and import them to Postgres on RDS.

To run queries on S3 and save the results in a table in postgres I've done the following:

Started a 3 node EMR cluster from the AWS console.
Manually SSH into the Master node to create an EXTERNAL table in hive, looking at an S3 bucket.

Manually SSH into each of the 3 nodes and add a new catalog file:

/etc/presto/conf.dist/catalog/postgres.properties

with the following contents

connector.name=postgresql
connection-url=jdbc:postgresql://ip-to-postgres:5432/database
connection-user=<user>
connection-password=<pass>

and edited this file

/etc/presto/conf.dist/config.properties

adding

datasources=postgresql,hive

Restart presto by running the following manually on all 3 nodes
```
sudo restart presto-server
```

This setup seems to work well.

In my application, there are multiple databases created dynamically. It seems that those configuration/catalog changes need to be made for each database and the server needs to be restarted to see the new config changes.

Is there a proper way for my application (using boto or other methods) to update configurations by

Adding a new catalog file in all nodes /etc/presto/conf.dist/catalog/ for each new database
Adding a new entry in all nodes in /etc/presto/conf.dist/config.properties
Gracefully restarting presto across the whole cluster (ideally when it becomes idle, but that's not a major concern.

if my answer has worked for you, please mark it as the "correct answer" i.stack.imgur.com/QpogP.png — mostafazh

Root G Root G · Accepted Answer · 2018-09-28T06:47:40

I believe you can run a simple bash script to achieve what you want. There is no other way except creating a new cluster with --configurations parameter where you provide the desired configurations. You can run below script from the master node.

#!/bin/sh

# "cluster_nodes.txt" with private IP address of each node.
aws emr list-instances --cluster-id <cluster-id> --instance-states RUNNING | grep PrivateIpAddress | sed 's/"PrivateIpAddress"://' | sed 's/\"//g' | awk '{gsub(/^[ \t]+|[ \t]+$/,""); print;}' > cluster_nodes.txt

# For each IP connect with ssh and configure.
while IFS='' read -r line || [[ -n "$line" ]]; do
    echo "Connecting $line"
    scp -i <PEM file> postgres.properties hadoop@$line:/tmp;
    ssh -i <PEM file> hadoop@$line "sudo mv /tmp/postgres.properties /etc/presto/conf/catalog;sudo chown presto:presto /etc/presto/conf/catalog/postgres.properties;sudo chmod 644 /etc/presto/conf/catalog/postgres.properties; sudo restart presto-server";
done < cluster_nodes.txt

How to add connectors to presto on Amazon EMR

3 Answers