4
votes

under linux,I have set env var $NLTK_DATA('/home/user/data/nltk'),and blew test works as expected

>>> from nltk.corpus import brown
>>> brown.words()
['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...]

but when running another python script,I got:

LookupError: 
**********************************************************************
Resource u'tokenizers/punkt/english.pickle' not found.  Please
use the NLTK Downloader to obtain the resource:  >>>
nltk.download()
Searched in:
- '/home/user/nltk_data'
- '/usr/share/nltk_data'
- '/usr/local/share/nltk_data'
- '/usr/lib/nltk_data'
- '/usr/local/lib/nltk_data'
- u''

As we can see,nltk doesn't add $NLTK_DATA to search path,after appending NLTK_DATA dir manually:

nltk.data.path.append("/NLTK_DATA_DIR");

script runs as expected,question is:

How to make nltk to add $NLTK_DATA to it's search path automatically?

2
Since the nltk_data directory is static, why do you need to find the path automatically?alvas
By default NLTK is automatically finds the nltk_data directory in these directory, - '/home/user/nltk_data' '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data'alvas
I have specified nltk.download() to download data to $NLTK_DATA,if don't add $NLTK_DATA to search directory,it seems that the downloaded data can't be used by scripts(but simple command can't used).Alex Luya
As long as nltk.path.append() is in your script, you don't really need to worry about the data directory in os.environ. See updated answer.alvas
The nltk does add the paths from NLTK_DATA to the data search path. The problem must be caused by something else: The second python script does not inherit the environment? The path is incorrect (or possibly a relative path, which only works in some directories)? Who knows. But the solution (for anyone having the same issue) is to examine and fix the setting of the variable. The NLTK Does The Right Thing when it sees the variable.alexis

2 Answers

7
votes

If you don't want to set the $NLTK_DATA before running your scripts, you can do it within the python scripts with:

import nltk
nltk.path.append('/home/alvas/some_path/nltk_data/')

E.g. let's move the the nltk_data to a non-standard path that NLTK won't find it automatically:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  misc  models  stemmers  taggers  tokenizers
alvas@ubi:~$ mkdir some_path
alvas@ubi:~$ mv nltk_data/ some_path/
alvas@ubi:~$ ls nltk_data/
ls: cannot access nltk_data/: No such file or directory
alvas@ubi:~$ ls some_path/nltk_data/
chunkers  corpora  grammars  help  misc  models  stemmers  taggers  tokenizers

Now, we use the nltk.path.append() hack:

alvas@ubi:~$ python
>>> import os
>>> import nltk
>>> nltk.path.append('/home/alvas/some_path/nltk_data/')
>>> nltk.pos_tag('this is a foo bar'.split())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
>>> nltk.data
<module 'nltk.data' from '/usr/local/lib/python2.7/dist-packages/nltk/data.pyc'>
>>> nltk.data.path
['/home/alvas/some_path/nltk_data/', '/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
>>> exit()

Let's move it back and see whether it works:

alvas@ubi:~$ ls nltk_data
ls: cannot access nltk_data: No such file or directory
alvas@ubi:~$ mv some_path/nltk_data/ .
alvas@ubi:~$ python
>>> import nltk
>>> nltk.data.path
['/home/alvas/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
>>> nltk.pos_tag('this is a foo bar'.split())
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]

If you really really want to find nltk_data automagically, use something like:

import scandir
import os, sys
import time

import nltk

def find(name, path):
    for root, dirs, files in scandir.walk(path):
        if root.endswith(name):
            return root

def find_nltk_data():
    start = time.time()
    path_to_nltk_data = find('nltk_data', '/')
    print >> sys.stderr, 'Finding nltk_data took', time.time() - start
    print >> sys.stderr,  'nltk_data at', path_to_nltk_data
    with open('where_is_nltk_data.txt', 'w') as fout:
        fout.write(path_to_nltk_data)
    return path_to_nltk_data

def magically_find_nltk_data():
    if os.path.exists('where_is_nltk_data.txt'):
        with open('where_is_nltk_data.txt') as fin:
            path_to_nltk_data = fin.read().strip()
        if os.path.exists(path_to_nltk_data):
            nltk.data.path.append(path_to_nltk_data)
        else:
            nltk.data.path.append(find_nltk_data())
    else:
        path_to_nltk_data  = find_nltk_data()
        nltk.data.path.append(path_to_nltk_data)


magically_find_nltk_data()
print nltk.pos_tag('this is a foo bar'.split())

Let's call that python script, test.py:

alvas@ubi:~$ ls nltk_data/
chunkers  corpora  grammars  help  misc  models  stemmers  taggers  tokenizers
alvas@ubi:~$ python test.py
Finding nltk_data took 4.27330780029
nltk_data at /home/alvas/nltk_data
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
alvas@ubi:~$ mv nltk_data/ some_path/
alvas@ubi:~$ python test.py
Finding nltk_data took 4.75850391388
nltk_data at /home/alvas/some_path/nltk_data
[('this', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'JJ'), ('bar', 'NN')]
0
votes

If you are someone who wants to install the NLTK data in a conda environment, and doesn't want to specify the data location in every script, or export the environment variable, you need to do the following:

  1. Activate your desired conda environment.
  2. Print sys.prefix within your conda environment, and copy this path (let's say /home/dickens/envs/nltk_env.
  3. Run nltk.download() within the conda environment, select your desired packages, and append /share/nltk_data to your path from above as the download location. For e.g. in our case, it will become /home/dickens/envs/nltk_env/share/nltk_data.
  4. You are now good to go!