Paths in AWS lambda with Python NLTK

Question

I'm encountering problems with the NLTK package in AWS Lambda. However I believe the issue is related more to path configurations in Lambda being incorrect. NLTK is having trouble finding data libraries that are stored locally and not part of the module install. Many of the solutions listed on SO are simple path configs as can be found here but I think this issue related to pathing in Lambda:

How to config nltk data directory from code?

What to download in order to make nltk.tokenize.word_tokenize work?

Should also mention this also relates to a previous question I posted here Using NLTK corpora with AWS Lambda functions in Python

but the issue seems more general and so I have elected to redefine the question as it relates how to correctly configure path environments in Lambda to work with modules that require external libraries like NLTK. NLTK stores a lot of it's data in a nltk_data folder locally, however including this folder within the lambda zip for upload, it doesn't seem to find it.

Also included in the Lambda func zip file are the following files and dirs:

\nltk_data\taggers\averaged_perceptron_tagger\averaged_perceptron_tagger.pickle
\nltk_data\tokenizers\punkt\english.pickle
\nltk_data\tokenizers\punkt\PY3\english.pickle

From the following site, it seems that var/task/ is the folder in which the lambda function executes and I have tried including this path to no avail. https://alestic.com/2014/11/aws-lambda-environment/

From the docs it also seems that there are a number of environment variables that can be used however I'm not sure how to include them in a python script (coming from windows, not linux) http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Hoping to throw this up here incase anyone has experience in configuring Lambda paths. I haven't seen a lot of questions relating to this specific issue despite searching, so hoping it could be useful to resolve this

Code is here

import nltk
import pymysql.cursors
import re
import rds_config
import logging
from boto_conn import botoConn
from warnings import filterwarnings
from nltk import word_tokenize

nltk.data.path.append("/nltk_data/tokenizers/punkt")
nltk.data.path.append("/nltk_data/taggers/averaged_perceptron_tagger")

logger = logging.getLogger()

logger.setLevel(logging.INFO)

rds_host = "nodexrd2.cw7jbiq3uokf.ap-southeast-2.rds.amazonaws.com"
name = rds_config.db_username
password = rds_config.db_password
db_name = rds_config.db_name

filterwarnings("ignore", category=pymysql.Warning)


def parse():

    tknzr = word_tokenize

    stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself','yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself',
                 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that','these','those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do',
                 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of','at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above',
                 'below','to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then','once', 'here','there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other',
                 'some', 'such','no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will','just', 'don', 'should','now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', 'couldn', 'didn', 'doesn', 'hadn', 'hasn',
                 'haven', 'isn', 'ma','mightn', 'mustn', 'needn', 'shan', 'shouldn', 'wasn', 'weren', 'won', 'wouldn']

    s3file = botoConn(None, 1).getvalue()
    db = pymysql.connect(rds_host, user=name, passwd=password, db=db_name, connect_timeout=5, charset='utf8mb4', cursorclass=pymysql.cursors.DictCursor)
    lines = s3file.split('\n')

    for line in lines:

        tkn = tknzr(line)
        tagged = nltk.pos_tag(tkn)

        excl = ['the', 'and', 'of', 'at', 'what', 'to', 'it', 'a', 'of', 'i', 's', 't', 'is', 'I\'m', 'Im', 'U', 'RT', 'RTs', 'its']  # Arg

        x = [i for i in tagged if i[0] not in stopwords]
        x = [i for i in x if i[0] not in excl]
        x = [i for i in x if len(i[0]) > 1]
        x = [i for i in x if 'https' not in i[0]]
        x = [i for i in x if i[1] == 'NNP' or i[1] == 'VB' or i[1] == 'NN']
        x = [(re.sub(r'[^A-Za-z0-9]+' + '()', r'', i[0])) for i in x]
        sql_dat_a, sql_dat = [], []

Output log is here:

   **********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************: LookupError
Traceback (most recent call last):
  File "/var/task/Tweetscrape_Timer.py", line 27, in schedule
    server()
  File "/var/task/Tweetscrape_Timer.py", line 14, in server
    parse()
  File "/var/task/parse_to_SQL.py", line 91, in parse
    tkn = tknzr(line)
  File "/var/task/nltk/tokenize/__init__.py", line 109, in word_tokenize
    return [token for sent in sent_tokenize(text, language)
  File "/var/task/nltk/tokenize/__init__.py", line 93, in sent_tokenize
    tokenizer = load('tokenizers/punkt/{0}.pickle'.format(language))
  File "/var/task/nltk/data.py", line 808, in load
    opened_resource = _open(resource_url)
  File "/var/task/nltk/data.py", line 926, in _open
    return find(path_, path + ['']).open()
  File "/var/task/nltk/data.py", line 648, in find
    raise LookupError(resource_not_found)
LookupError: 
**********************************************************************
  Resource u'tokenizers/punkt/english.pickle' not found.  Please
  use the NLTK Downloader to obtain the resource:  >>>
  nltk.download()
  Searched in:
    - '/home/sbx_user1067/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/nltk_data/tokenizers/punkt'
    - '/nltk_data/taggers/averaged_perceptron_tagger'
    - u''
**********************************************************************

Question to you, why are you using lambda instances with Windows? Wouldn't it be easier to deploy a linux server for the lambda instances? — alvas
BTW, does amazon lambda allow you to deploy a windows instance? — alvas
If your amazon server instance is linux, simply use export ... in your .bashrc file per user or /etc/profile for all users, see serverfault.com/questions/491585/… — alvas
Did you try the magically_find_nltk_data()? What is the output for that function for different users? Maybe that'll give us clues on how to set the path correctly. — alvas

Matt Fortier Matt Fortier · Accepted Answer · 2017-02-22T14:40:55

Seems your current Python code runs from /var/task. I would suggest trying (haven't tried myself):

nltk.data.path.append("/var/task/nltk_data")

Paths in AWS lambda with Python NLTK

3 Answers