Using NLTK corpora with AWS Lambda functions in Python

Question

I'm encountering a difficulty when using NLTK corpora (in particular stop words) in AWS Lambda. I'm aware that the corpora need to be downloaded and have done so with NLTK.download('stopwords') and included them in the zip file used to upload the lambda modules in nltk_data/corpora/stopwords.

The usage in the code is as follows:

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
nltk.data.path.append("/nltk_data")

This returns the following error from the Lambda log output

module initialization error: 
**********************************************************************
  Resource u'corpora/stopwords' not found.  Please use the NLTK
  Downloader to obtain the resource:  >>> nltk.download()
  Searched in:
    - '/home/sbx_user1062/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data'
**********************************************************************

I have also tried to load the data directly by including

nltk.data.load("/nltk_data/corpora/stopwords/english")

Which yields a different error below

module initialization error: Could not determine format for file:///stopwords/english based on its file
extension; use the "format" argument to specify the format explicitly.

It's possible that it has a problem loading the data from the Lambda zip and needs it stored externally.. say on S3, but that seems a bit strange.

Any idea what format the

Does anyone know where I could be going wrong?

try stopwords = nltk.corpus.stopwords.words('english') and in the block of code it looks like it looks in the nltk_data folder for corpora.stopwords, but the intervening / is missing. That might just be a directory address issue. Not 100% sure this will work, because I cannot see your system or the file, but it otherwise looks OK — sconfluentus
Use the full path, e.g. /home/sbx_user1062/nltk_data and try: stackoverflow.com/a/22987374/610569 — alvas
If nothing works, see magically_find_nltk_data() from stackoverflow.com/questions/36382937/… — alvas
Thanks, I will try those suggestions and report back. One problem is that the user name eg: 'sbx_user1062' is different every time the AWS Lambda script is run. Which may mean that I need to locate the files at a static source on S3 unless I can find another way to specify the execution directory. — Praxis
Move the directory into a static asset and fix the nltk_data directory. A simple AWS Lambda service might not be sufficient, you would need some "AWS Simple Storage". — alvas

jonathan_007 jonathan_007 · Accepted Answer · 2017-08-28T21:35:28

I had the same problem before but I solved it using the environment variable.

Execute "nltk.download()" and copy it to the root folder of your AWS lambda application. (The folder should be called "nltk_data".)
In the user interface of your lambda function (in the AWS console), you add "NLTK_DATA" = "./nltk_data". Please see the image.

Using NLTK corpora with AWS Lambda functions in Python

4 Answers