0
votes

I have a set of texts that I am processing for the Johns Hopkins Capstone project. I am using quanteda as my core text handling library. I work on my Macbook Pro at home and a Windows 7 64-bit at work. My R script appears to run correctly on my Mac, but fails on my Win7 system. I cannot provide the source text material due to course restrictions. I hope that I will provide enough information below to get some assistance. My current approach is to create a corpus from the text file, tokenize it without ngrams, and then run ngrams on the tokenized file. Below are my code snippets.

I pull the data from the text file with the following:

con <- file(src_file, open="rb")
tmp <- scan(con,
            what = "complex",
            nlines = maxlines,
            # I break the large file into portions via the stp variable
            # which is a number from 1 to 10
            skip = maxlines * stp, 
            fileEncoding = "UTF-16LE",
            encoding = "ASCII",
            blank.lines.skip = TRUE,
            na.strings = "",
            skipNul = TRUE)

The tmp object is saved to an Rds file.

The following functions are used around the quanteda elements

make_corpus <- function(lines) {
    lines <- toLower(lines)
    cat("Making corpus\n")
    t <- corpus(lines)
}

tok_corpus <- function(lines) {
    lines <- toLower(lines)
    cat("Making vocabulary\n")
    t <- tokenize(lines,
                   what = "word",
                   verbose = TRUE,
                   simplify = FALSE,
                   removeSeparators = TRUE,
                   removeNumbers = TRUE,
                   removePunct = TRUE,
                   removeTwitter = TRUE
              )
    }

make_ngrams <- function(lines) {
    lines <- toLower(lines)
    cat("Making ngrams\n")
    t <- ngrams(lines, n = c(1,4) )
}

The following proceeds from file to ngrams.

cat("...creating corpus\n")
# smp_all has been read from previously mentioned Rds file
voc_corpus <- make_corpus(smp_all)

cat("...going to make vocabulary\n")
vocab <- tok_corpus(voc_corpus)

cat("...going to make n_grams\n")
n_grams <- make_ngrams(vocab)

The following is the output from the script.

Removing working files...
Loading text files...
Read 37716 items
Read 28848 items
Read 12265 items
...Building smp_all
...creating corpus
Making corpus
Non-UTF-8 encoding (possibly) detected  :ISO-8859-2.
...going to make vocabulary
Making vocabulary
Starting tokenization...
...tokenizing texts...total elapsed:  0.48 seconds.
...replacing names...total elapsed:  0 seconds.
Finished tokenizing and cleaning 66,565 texts.
...going to make n_grams
Making ngrams
Error in if (any(stringi::stri_detect_fixed(x, " ")) & concatenator !=  : 
  missing value where TRUE/FALSE needed

On my Mac, the Making ngrams provides statistics on what was produced, but on the Win7, the above error is seen.

I am running this in the R console.

System information:

R version 3.2.3 (2015-12-10) -- "Wooden Christmas-Tree" Copyright (C) 2015 The R Foundation for Statistical Computing Platform: x86_64-w64-mingw32/x64 (64-bit)

Quanteda Version: 0.9.0-1 Date: 2015-11-26

Thanks in advance.

1
I suspect you have some illegal characters that are encoding OK in OS X, but not on Windows 7. This could be something like the escaped hex codes for a smart quote or a dash. Please file an issue at github.com/kbenoit/quanteda/issues and provide data so I can reproduce the problem. SO is not for bug reports and your problem needs to be reproducible for us to answer it on SO. See stackoverflow.com/help/how-to-ask for guidelines.Ken Benoit
Thank you, @KenBenoit. I am new here. I'll put the information at Github.Harold Trammel
No problem! First time for everyone.Ken Benoit

1 Answers

0
votes

I got it. I met the exact problem just now and figured out the reason.

Some sentence in your corpus are just a short chars, and probably useless. As a result, after your pre-processing, the contents were eliminated. So the result of such sentence would become a NA. That's why this error happens while ngraming.

Solution: Clean your corpus again and remove NAs.