R {tm} package: large PCorpus causing "Error in readSingleKey..."

Question

I am having a strange issue that I can't seem to make heads or tails of. Any help is greatly appreciated. I'm running RStudio Version 0.99.879 on a Mac Book El Capitan 10.11.3.

Basically I'm trying to make a DocumentTermMatrix for the massive wikipedia corpus. I've parsed the wikipedia xml dump with https://github.com/idio/json-wikipedia and then wrote an R script to parse each wiki article into it's own text file where the name of the file is the name of the wiki article and the content of the file is the content of the article.

So, I have 13 million relatively small .txt files now and I've been trying to incrementally see how many I can make into a DTM.

When I run this with 13,000 articles in the folder called wiki-test it works great:

start <- Sys.time()
w13k <- PCorpus(DirSource("json-wikipedia/wiki-test"),
                         dbControl = list(dbName = "wiki-test.db", dbType = "DB1"))
Sys.time() - start
#Time difference of 28.637 secs for 13k files

start <- Sys.time()
dtm13k <- DocumentTermMatrix(w13k,
                        control = list(bounds = list(global = c(1,     length(w13k)*.5)), # so that it only consider terms that appear in less     than 20% of the documents
                          weighting =
                            function(x)
                              weightTfIdf(x, normalize =
                                            TRUE),
                          removePunctuation = TRUE,
                          removeNumbers = TRUE,
                          stopwords = TRUE , stemming=TRUE 
                        ))
Sys.time() - start
#Time difference of 1.288023 mins for 13k files

However, when I try to run it with a larger folder of about 210,000 files, it takes 10 mins and then creates a corpus that doesn't seem to work:

start <- Sys.time()
w210k <- PCorpus(DirSource("json-wikipedia/wiki-articles"),
            dbControl = list(dbName = "wiki-test-210k.db", dbType = "DB1"))
Sys.time() - start
#Time difference 10 mins

As you can see, basically the same code, except wiki-articles contains a lot more files.

Then when try to run the same DocumentTermMatrix call, I get:

Error in readSingleKey(con, map, key) : 
  unable to obtain value for key '-_-_-album-.txt'
In addition: Warning message:
In readKeyMap(filecon) : NAs introduced by coercion to integer range

(-_-_-album.txt is the first article in the corpus)

I was perplexed, so I tried to look at the offending document with w210k[[1]] and got the same error. Then I tried looking at other documents in the corpus (which I know worked in the previous corpus, so it's not the input documents) like w210k[[100000]] and I get the same error:

Error in readSingleKey(con, map, key) : 
  unable to obtain value for key 'John_Arthur_Roebuck.txt'
In addition: Warning messages:
1: In readKeyMap(filecon) : NAs introduced by coercion to integer range
2: In readKeyMap(filecon) : NAs introduced by coercion to integer range
3: In readKeyMap(filecon) : NAs introduced by coercion to integer range

Notice that it picks up the different (and correct) id for each article, but for some reason can't "obtain a value" for that key. I can't seem to find anything about this in the documentation and when I google that error message, nothing seems to come up.

Any idea what might be causing this? Is there some kind of limit on size? I feel like I've seen people make much bigger DTM's than this with no problem. If size is the problem, any ideas on how to make this huge DTM? I'm going to end up with about 13,000,000 documents, hopefully something like 500,000 terms.

Thanks in advance.

We would need some reproducible data to test. Can you provide a few of the sample text files that you have created from the wikipedia xml dump or share your R parser script that creates the text files. — amitkb3

seth127 seth127 · Accepted Answer · 2016-03-14T19:13:59

Well, I sort of figured it out. If I use dbType="RDS" in the PCorpus call, instead of "DB1" then it works. This is nice, but now I'm a bit confused as to what the different dbTypes really mean. I have a separate question about that, if anyone has an answer: https://stackguides.com/questions/35995536/filehashoption-in-r-tm-and-filehash-packages-what-are-the-different-types

Thanks.

R {tm} package: large PCorpus causing "Error in readSingleKey..."

1 Answers