I have a set of texts that I am processing for the Johns Hopkins Capstone project. I am using quanteda as my core text handling library. I work on my Macbook Pro at home and a Windows 7 64-bit at work. My R script appears to run correctly on my Mac, but fails on my Win7 system. I cannot provide the source text material due to course restrictions. I hope that I will provide enough information below to get some assistance. My current approach is to create a corpus from the text file, tokenize it without ngrams, and then run ngrams on the tokenized file. Below are my code snippets.
I pull the data from the text file with the following:
con <- file(src_file, open="rb")
tmp <- scan(con,
what = "complex",
nlines = maxlines,
# I break the large file into portions via the stp variable
# which is a number from 1 to 10
skip = maxlines * stp,
fileEncoding = "UTF-16LE",
encoding = "ASCII",
blank.lines.skip = TRUE,
na.strings = "",
skipNul = TRUE)
The tmp object is saved to an Rds file.
The following functions are used around the quanteda elements
make_corpus <- function(lines) {
lines <- toLower(lines)
cat("Making corpus\n")
t <- corpus(lines)
}
tok_corpus <- function(lines) {
lines <- toLower(lines)
cat("Making vocabulary\n")
t <- tokenize(lines,
what = "word",
verbose = TRUE,
simplify = FALSE,
removeSeparators = TRUE,
removeNumbers = TRUE,
removePunct = TRUE,
removeTwitter = TRUE
)
}
make_ngrams <- function(lines) {
lines <- toLower(lines)
cat("Making ngrams\n")
t <- ngrams(lines, n = c(1,4) )
}
The following proceeds from file to ngrams.
cat("...creating corpus\n")
# smp_all has been read from previously mentioned Rds file
voc_corpus <- make_corpus(smp_all)
cat("...going to make vocabulary\n")
vocab <- tok_corpus(voc_corpus)
cat("...going to make n_grams\n")
n_grams <- make_ngrams(vocab)
The following is the output from the script.
Removing working files...
Loading text files...
Read 37716 items
Read 28848 items
Read 12265 items
...Building smp_all
...creating corpus
Making corpus
Non-UTF-8 encoding (possibly) detected :ISO-8859-2.
...going to make vocabulary
Making vocabulary
Starting tokenization...
...tokenizing texts...total elapsed: 0.48 seconds.
...replacing names...total elapsed: 0 seconds.
Finished tokenizing and cleaning 66,565 texts.
...going to make n_grams
Making ngrams
Error in if (any(stringi::stri_detect_fixed(x, " ")) & concatenator != :
missing value where TRUE/FALSE needed
On my Mac, the Making ngrams provides statistics on what was produced, but on the Win7, the above error is seen.
I am running this in the R console.
System information:
R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit)
Quanteda Version: 0.9.0-1 Date: 2015-11-26
Thanks in advance.