0
votes

I am trying to build an NLTK corpus using information from Pubmed.

In my first attempt, I successfully built a small function to retrieve the data using the Entrez package, got the retrieved article titles (a list of strings, i.e. the titles) into a corpus of files (each title as a new file) and created a corpus using each 'fileid' (i.e. the filename) as the category of the document.

Now I have to step up the game: each document of the corpus needs to have a title, an abstract and the respective MeSH terms (these last need to define the categories of the corpus, instead of those being defined by the name of the document).

So now I have a few problems that I don't really see how to resolve. I will start backwards, as it may be easier to understand:

1) My corpus reader goes as follows:

corpus = CategorizedPlaintextCorpusReader(corpus_root, file_pattern,
                                      cat_pattern=r'(\w+)_.*\.txt')

where 'cat_pattern' is a regular expression for extracting the category names from the fileids arguments, i.e. the names of the files. But now I need to get these categories from the MeSH terms within the file, which leads to the next problem:

2) the Pubmed query retrieves a batch of information, from where I first took only the titles (the ones that I would use to generate the corpus), but now I need to retrieve the titles, the abstract, and the MeSH terms.

The pseudo-code would be something as follows:

papers = [] 

'Papers' is a list containing all the articles retrieved, as well as all the information related to the articles. Let's say I then have:

out = []
for each in range(0, len(papers)):
    out.append(papers[each]['TI'])
    out.append(papers[each]['AB'])
    out.append(papers[each]['MH'])

That last part of the list, the ['MH'] (the list of MeSH terms), is what I need to use to define the categories of the corpus.

3) After I build the corpus with these 3 pieces of information, to be able to use my classifier, I also need to somehow transform all this batch of information into this:

# X: a list or iterable of raw strings, each representing a document.
X = [corpus.raw(fileid) for fileid in corpus.fileids()]

Remembering that "fileid" is each of the documents of the corpus. This is the code from the first prototype, where each document was composed of a single string (the title), and that now each "document" must have the title (['TI']), the abstract (['AB']), and the MeSH terms (['MH'] - this one I'm not sure, because of the next code:)

# y: a list or iterable of labels, which will be label encoded.
y = [corpus.categories(fileid)[0] for fileid in corpus.fileids()]

Here, the y represents the labels, which were the filenames, and now I need the labels to be the MeSH terms.

I don't know how to make this happen, or even if this is possible as far as my knowledge goes, and yes I did search and read the NLTK book tutorials, many pages on how to build NLTK corpora, etc etc..., but nothing seems to fit what I intend to do.

This may be very confusing, but let me know if you need me to rephrase anything. Any help would be appreciated :)

1
I had already seen that, and unfortunately it doesn’t help, but thank you - tanmald

1 Answers

2
votes

The cat_pattern argument is convenient when the category can be determined from the filename, but in your case it is not enough. Fortunately there are other ways to specify file categories. Write an ad hoc program to figure out the categories of each file in your corpus, and store the results in a file corpus_categories (or whatever; just make sure the name doesn't match the corpus filename pattern, so that you can place it in the corpus folder). Then initialize your reader with cat_file="corpus_categories" instead of cat_pattern.

corpus = CategorizedPlaintextCorpusReader(
                           corpus_root, 
                           file_pattern,
                           cat_file="corpus_categories")

Each line in the category file should have a filename and its category or categories, separated by spaces. Here's a snippet from cats.txt for the reuters corpus:

training/196 earn
training/197 oat corn grain
training/198 money-supply
training/199 acq
training/200 soy-meal soy-oil soybean meal-feed oilseed veg-oil

I've no idea what you're trying to accomplish in your question 3, but it seems pretty clear that it's unrelated to creating the categorized corpus (and hence you should ask it as a separate question).