3
votes

I'm trying to run a code that uses psycopg2 to manipulate a Redshift instance. I have tried by importing a wheel file as I see they are supported in Glue python jobs. I see the library is installed in the endpoint when running but then I get an error:

import boto3
import psycopg2
Aug 4, 2020, 1:24:06 PM Pending execution
Processing ./glue-python-libs-92ng4pcb/psycopg2-2.8.5-cp36-cp36m-win_amd64.whl
Installing collected packages: psycopg2
Successfully installed psycopg2-2.8.5
Considering file without prefix as a python extra file s3://gluelibraries/boto3.zip
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.

2020-08-04T13:24:44.831+02:00
Traceback (most recent call last):
  File "/tmp/runscript.py", line 123, in <module>
    runpy.run_path(temp_file_path, run_name='__main__')
  File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/glue-python-scripts-1t08aq9n/postloading.py", line 6, in <module>
  File "/glue/lib/installation/psycopg2/__init__.py", line 51, in <module>
    from psycopg2._psycopg import (                     # noqa
ModuleNotFoundError: No module named 'psycopg2._psycopg'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-1t08aq9n/postloading.py", line 6, in <module>
  File "/glue/lib/installation/psycopg2/__init__.py", line 51, in <module>
    from psycopg2._psycopg import (                     # noqa
ModuleNotFoundError: No module named 'psycopg2._psycopg'

Theoretically Glue jobs in python (contrary to pyspark jobs) should support non pure python libraries

6

6 Answers

1
votes

I have faced the similar issue with psycopg2 package. It is to do with the compatibility with Python runtime that is accessing the psycopg2 module.

Follow this thread . Hope you'll have your solution. Using psycopg2 with Lambda to Update Redshift (Python)

1
votes

based on https://stackoverflow.com/a/58305654/4725074

Install psycopg2-binary into a directory and zip up the contents of that directory:

mkdir psycopg2-binary
cd psycopg2-binary
pip install psycopg2-binary -t  .
# in case using python3:
# python3 -m pip install --system psycopg2-binary -t  .
zip -r9 psycopg2.zip *

I then copied psycopg2.zip to an S3 bucket and add it as an extra Python library under "Python library path" in the Glue Spark job.

I then launched the job with the following script to verify if psycopg2 is present (the zip file will be downloaded by Glue into the directory in which the Job script is located)

from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
import sys
import os
import zipfile

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

zip_ref = zipfile.ZipFile('./psycopg2.zip', 'r')
print os.listdir('.')
zip_ref.extractall('/tmp/packages')
zip_ref.close()
sys.path.insert(0, '/tmp/packages')

import psycopg2
print(psycopg2.__version__)

job.commit()

This worked for me.

0
votes

Instead of psycopg2, try using pg8000 which is easy to install and it doesn't have c dependencies. Also, it is used by amazon in most of their internal projects.

0
votes

After having tried with pg8000 with a Python endpoint I got the following error:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 123, in <module>
    runpy.run_path(temp_file_path, run_name='__main__')
  File "/usr/local/lib/python3.6/runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "/usr/local/lib/python3.6/runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/tmp/glue-python-scripts-j7khvbvv/postloading.py", line 7, in <module>
ModuleNotFoundError: No module named 'pg8000'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/runscript.py", line 142, in <module>
    raise e_type(e_value).with_traceback(new_stack)
  File "/tmp/glue-python-scripts-j7khvbvv/postloading.py", line 7, in <module>
ModuleNotFoundError: No module named 'pg8000'

when using a pyspark endpoint I don't have this problem with the pg8000

0
votes

Now with Glue Version 2 you can pass in python libraries as parameters to Glue Jobs. I used pyscopg2-binary instead of pyscopg2 and it worked for me. Then in the code I did import psycopg2.

--additional-python-modules

0
votes

I download wheel from this link with name psycopg2-2.9.1-cp36-cp36m-linux_x86_64.whl and problem was solved. Thanks