I am having a strange issue that I can't seem to make heads or tails of. Any help is greatly appreciated. I'm running RStudio Version 0.99.879 on a Mac Book El Capitan 10.11.3.
Basically I'm trying to make a DocumentTermMatrix for the massive wikipedia corpus. I've parsed the wikipedia xml dump with https://github.com/idio/json-wikipedia and then wrote an R script to parse each wiki article into it's own text file where the name of the file is the name of the wiki article and the content of the file is the content of the article.
So, I have 13 million relatively small .txt files now and I've been trying to incrementally see how many I can make into a DTM.
When I run this with 13,000 articles in the folder called wiki-test it works great:
start <- Sys.time()
w13k <- PCorpus(DirSource("json-wikipedia/wiki-test"),
dbControl = list(dbName = "wiki-test.db", dbType = "DB1"))
Sys.time() - start
#Time difference of 28.637 secs for 13k files
start <- Sys.time()
dtm13k <- DocumentTermMatrix(w13k,
control = list(bounds = list(global = c(1, length(w13k)*.5)), # so that it only consider terms that appear in less than 20% of the documents
weighting =
function(x)
weightTfIdf(x, normalize =
TRUE),
removePunctuation = TRUE,
removeNumbers = TRUE,
stopwords = TRUE , stemming=TRUE
))
Sys.time() - start
#Time difference of 1.288023 mins for 13k files
However, when I try to run it with a larger folder of about 210,000 files, it takes 10 mins and then creates a corpus that doesn't seem to work:
start <- Sys.time()
w210k <- PCorpus(DirSource("json-wikipedia/wiki-articles"),
dbControl = list(dbName = "wiki-test-210k.db", dbType = "DB1"))
Sys.time() - start
#Time difference 10 mins
As you can see, basically the same code, except wiki-articles contains a lot more files.
Then when try to run the same DocumentTermMatrix call, I get:
Error in readSingleKey(con, map, key) :
unable to obtain value for key '-_-_-album-.txt'
In addition: Warning message:
In readKeyMap(filecon) : NAs introduced by coercion to integer range
(-_-_-album.txt is the first article in the corpus)
I was perplexed, so I tried to look at the offending document with w210k[[1]] and got the same error. Then I tried looking at other documents in the corpus (which I know worked in the previous corpus, so it's not the input documents) like w210k[[100000]] and I get the same error:
Error in readSingleKey(con, map, key) :
unable to obtain value for key 'John_Arthur_Roebuck.txt'
In addition: Warning messages:
1: In readKeyMap(filecon) : NAs introduced by coercion to integer range
2: In readKeyMap(filecon) : NAs introduced by coercion to integer range
3: In readKeyMap(filecon) : NAs introduced by coercion to integer range
Notice that it picks up the different (and correct) id for each article, but for some reason can't "obtain a value" for that key. I can't seem to find anything about this in the documentation and when I google that error message, nothing seems to come up.
Any idea what might be causing this? Is there some kind of limit on size? I feel like I've seen people make much bigger DTM's than this with no problem. If size is the problem, any ideas on how to make this huge DTM? I'm going to end up with about 13,000,000 documents, hopefully something like 500,000 terms.
Thanks in advance.