4
votes

I have written a python job that uses sqlAlchemy to query a SQL Server database, however when using external libraries with AWS Glue you are required to wrap these libraries in an egg file. This causes an issue with the sqlAlchemy package as it uses the pyodbc package that cannot be wrapped in an egg as to my understanding it has other dependencies.

I have attempted to try and find a way of connecting to a SQL Server database within a Python Glue job but so far the closest advice I've been able to find suggests I write a Spark job instead which isn't appropriate.

Does anyone have experience with connecting to SQL Server within a Python 3 Glue Job? If so can I have an example snippet of code + packages used?

1

1 Answers

0
votes

Yes, I actually managed to do something similar by bundling dependencies including transitive dependencies.

Follow the below steps:

1 - Create a script which zips all of the code and dependencies into a zip file and upload to S3:

python3 -m pip install -r requirements.txt --target custom_directory
python3 -m zipapp custom_directory/
mv custom_directory.pyz custom_directory.zip

Upload this zip instead of egg or wheel.

2 - Create a driver program which executes your python source program which we just zipped in step 1.

import sys

if len(sys.argv) == 1:
    raise SyntaxError("Please provide a module to load.")
sys.path.append(sys.argv[1])

from your_module import your_function
sys.exit(your_function())

3 - You can then submit your job using:

spark-submit --py-files custom_directory.zip your_program.py

See:

How can you bundle all your python code into a single zip file?

I can't seem to get --py-files on Spark to work