0
votes

I tried to run a Glue job in python-shell by adding external dependencies (like pyathena, pytest,etc ..) as python egg file/ whl file in the job configurations as mentioned in the AWS documentation https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html.

The Glue job is configured under VPC having no internet and its execution resulted in the below error.

WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fd05d6a4f28>, 'Connection to pypi.org timed out. (connect timeout=15)')'

I even tried modifying my python script with the below code

import os
import site
import importlib
from setuptools.command import easy_install
install_path = os.environ['GLUE_INSTALLATION']

libraries = ["pyathena"]

for lib in libraries:
    easy_install.main( ["--install-dir", install_path , lib] )

importlib.reload(site)

On executing the above code i faced below error

Download error on https://pypi.org/simple/pyathena/: [Errno 99] Cannot assign requested address -- Some packages may not be found! Couldn't find index page for 'pyathena' (maybe misspelled?)

Can i have sample code snippet to generate an egg/whl file for external python packages and to add those part of Glue python-shell job

1
Can you try this steps helicaltech.com/external-python-libraries-aws-glue-job and also make sure that vpc you are using has a S3 endpointPrabhakar Reddy
@PrabhakarReddy - thankyou, steps mentioned in the above link helped to resolve the issuesPraful
I have added answer. Please mark it as answered if it helped.Prabhakar Reddy

1 Answers

1
votes

Refer to this doc which has steps in detail for packaging a python library. Also make sure that your VPC has s3 endpoint enter link description here as traffic will not leave AWS network when you run a Glue job inside VPC.