5
votes

I am new to NLTK (http://www.nltk.org/), and python for that matter. I wish to use the NLTK python library, but use the BNC for the corpus. I do not believe this corpus is distributed through the NLTK Data download. Is there a way to import the BNC corpus to be used by NLTK. If so, how? I did find a function called BNCCorpusReader but have no idea how to use it. Also, at the BNC site, I was able to download the corpus (http://ota.ox.ac.uk/desc/2554).

http://www.nltk.org/api/nltk.corpus.reader.html?highlight=bnc#nltk.corpus.reader.BNCCorpusReader.word

Update

I have tried entrophy's suggestion, but get the following error:

raise IOError('No such file or directory: %r' % _path)
OSError: No such file or directory: 'C:\\Users\\jason\\Documents\\NetBeansProjects\\DemoCollocations\\src\\Corpora\\bnc\\A\\A0\\A00.xml'

My code to read in the corpora:

bnc_reader = BNCCorpusReader(root="Corpora/bnc", fileids=r'[A-K]/\w*/\w*\.xml')

And by corpora is located in: C:\Users\jason\Documents\NetBeansProjects\DemoCollocations\src\Corpora\bnc\

1
What are your purposes? Dou you must use NLTK? I don't know Python very well and never used NLTK, but I processed the BNC in Java using Stanford Core NLP. My goal was to build a correct corpus to parse to get dependencies between pairs of word. So, starting from BNC's xml files, I recreated every sentences with an xml parser. Then I processed each sentence with the Core NLP. If your goal is simply to import the corpus, honestly I can't respond you, but in final instance you can still create the txt format of your xml corpus and the pass it to python and finally process it string by string.s.dallapalma
@s.dallapalma Hello. I am not required to use NLTK but I do need to be able to use some library which I can use to find "Collocations" of words. I looked at Stanford Core NLP but was told it did not have a Collocations functionality.jason

1 Answers

7
votes

In regards to examples usage of nltk for collocation extraction, take a look at the following guide: A how-to guide by nltk on collocations extraction

As far as BNC corpus reader is concerned, all the information was right there in the documentation.

from nltk.corpus.reader.bnc import BNCCorpusReader
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# Instantiate the reader like this
bnc_reader = BNCCorpusReader(root="/path/to/BNC/Texts", fileids=r'[A-K]/\w*/\w*\.xml')

#And say you wanted to extract all bigram collocations and 
#then later wanted to sort them just by their frequency, this is what you would do.
#Again, take a look at the link to the nltk guide on collocations for more examples.

list_of_fileids = ['A/A0/A00.xml', 'A/A0/A01.xml']
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(bnc_reader.words(fileids=list_of_fileids))
scored = finder.score_ngrams(bigram_measures.raw_freq)

print(scored)

The output of that will look something like this:

[(('of', 'the'), 0.004902261167963723), (('in', 'the'),0.003554139346773699), 
 (('.', 'The'), 0.0034315828175746064), (('Gift', 'Aid'), 0.0019609044671854894), 
 ((',', 'and'), 0.0018996262025859428), (('for', 'the'), 0.0018383479379863962), ... ]

And if you wanted to sort them using the score, you could try something like this

sorted_bigrams = sorted(bigram for bigram, score in scored)

print(sorted_bigrams)

Resulting:

[('!', 'If'), ('!', 'Of'), ('!', 'Once'), ('!', 'Particularly'), ('!', 'Raising'), 
 ('!', 'YOU'), ('!', '‘'), ('&', 'Ealing'), ('&', 'Public'), ('&', 'Surrey'), 
 ('&', 'TRAINING'), ("'", 'SPONSORED'), ("'S", 'HOME'), ("'S", 'SERVICE'), ... ]