0
votes

I am using glue console not dev endpoint. The glue job is able to access glue catalogue and table using below code

datasource0 = glueContext.create_dynamic_frame.from_catalog(database = 
"glue-db", table_name = "countries")
print "Table Schema:", datasource0.schema()
print "datasource0", datasource0.show() 

Now I want to get the metadata for all tables from the glue data base glue-db. I could not find a function in awsglue.context api, therefore i am using boto3.

client = boto3.client('glue', 'eu-central-1')
responseGetDatabases = client.get_databases()
databaseList = responseGetDatabases['DatabaseList']
for databaseDict in databaseList:
    databaseName = databaseDict['Name']
    print ("databaseName:{}".format(databaseName))
    responseGetTables = client.get_tables( DatabaseName = databaseName, 
    MaxResults=123)
    print("responseGetDatabases{}".format(responseGetTables))
    tableList = responseGetTables['TableList']
    print("response Object{0}".format(responseGetTables))
    for tableDict in tableList:
        tableName = tableDict['Name']
        print("-- tableName:{}".format(tableName))

the code runs in lambda function, but fails within glue etl job with following error

botocore.vendored.requests.exceptions.ConnectTimeout: HTTPSConnectionPool(host='glue.eu-central-1.amazonaws.com', port=443): Max retries exceeded with url: / (Caused by ConnectTimeoutError(, 'Connection to glue.eu-central-1.amazonaws.com timed out. (connect timeout=60)'))

The problem seems to be in environment configuration. Glue VPC has two subnets private subnet: with s3 endpoint for glue, allows inbound traffic from the RDS security group. It has public subnet: in glue vpc with nat gateway. Private subnet is reachable through gate nat Gateway. I am not sure what i am missing here.

3
Can you verify if 443 port is open to internet as it requires other services for it to work and also check try passing the region along with client = boto3.client('glue')Prabhakar Reddy
yes the port 443 is open and i have added the region, still times out after 15 minutes and the job fails. the security group of the glue vpc looks like this. i have allowed almost all traffic for testing purpose but still cannot connect glue using boto3 All TCP TCP 0 - 65535 0.0.0.0/0 All TCP TCP 0 - 65535 self reference PostgreSQL TCP 5432 Sg of the peered VPC All traffic All All Self referencing group All traffic All All Sg of the peered VPCUraish
Hi @Uraish did you find a solution for this? I'm facing the same problem and would very much appreciate some help. Thanks.crojassoto
Same issue here, @Uraish if you found a solution, please update. Thanks!Ryan Fisher

3 Answers

2
votes

Try using a proxy while creating the boto3 client:

from pyhocon import ConfigFactory
service_name = 'glue'


default = ConfigFactory.parse_file('glue-default.conf')
override = ConfigFactory.parse_file('glue-override.conf')

host = override.get('proxy.host', default.get('proxy.host'))
port = override.get('proxy.port', default.get('proxy.port'))

config = Config()

if host and port:
    config.proxies = {'https': '{}:{}'.format(host, port)}

client = boto3.Session(region_name=region).client(service_name=service_name, config=config)

glue-default.conf and glue-override.conf are deployed to the cluster by glue while spark submit into the /tmp directory.

I had a similar issue and I did the same by using the public library from glue: s3://aws-glue-assets-eu-central-1/scripts/lib/utils.py

0
votes

can you please try the boto client creation as below by specifying the region explicitly?

client = boto3.client('glue',region_name='eu-central-1')
0
votes

I had a similar problem when I was running this command from Glue Python Shell.

So I created endpoint (VPC->Endpoints) for Glue service (service name: "com.amazonaws.eu-west-1.glue"), this one was assigned to the same Subnet and Security Group as the Glue Connection which was used in the Glue Python Shell Job.