0
votes

after reading a lot of posts, I still have probs with making a custom corpus in nltk. I have a text file of tagged sentences, each item in the string of the form ... word/tag . I want to train a tagger using this stuff. I'm trying to use a nltk package called train-tagger which trains various types of taggers. 2 questions. 1) can train-tagger use a text file as input or only an nltk corpus object? 2) if only uses a corpus, how create one from a text file? I tried the following code to create a corpus...

import nltk
from nltk.corpus import PlaintextCorpusReader
corpus_root = './'
newcorpus = PlaintextCorpusReader(corpus_root, '.*')
print newcorpus.raw('IOBHarrisonsTraining.txt')  .... this is my tagged text file

seems to work but I can't find the output. There is supposed to be a corpus created either in the folder this code runs from, or else in nltk_data/corpora but nothing found. Is there some method in the corpus module that is supposed to save the 'newcorpus' I created? which could then be used as inpupt to train-tagger? also, should I be using a tagged-sentence file as input to PlaintextCorpusReader or just an untagged set of sentences?

1

1 Answers

3
votes

NLTK corpora are stored as collections of text files. The NLTK corpus functionality is organized as a number of reader classes for various file formats. You'll find them in nltk.corpus.reader. The nltk.corpus module also provides shortcuts to the corpora in nltk_data; they just launch the appropriate reader class with the path to the corpus files. But new corpora don't magically appear as objects in nltk.corpus; to read your own, instantiate the appropriate reader class. For example, in nltk/corpus/init.py you'll find the following:

gutenberg = LazyCorpusLoader(
    'gutenberg', PlaintextCorpusReader, r'(?!\.).*\.txt')

PlaintextCorpusReader is imported from nltk.corpus.reader, where all the other reader classes can be found. You can use it directly without relying on LazyCorpusReader; check the documentation.

But indeed there's no support for writing corpora in the various supported formats. To do that, find a corpus that's similar to yours, and emulate its format. You can then use the same reader to read your corpus. (For example, the Brown corpus reveals that it consists of space-separated tokens in the format word/tag)