2
votes

I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.

I'm attempting this with Quanteda and have the following code:

library(quanteda)

bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)


# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))

bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)

It seems to work smoothly until predict(), which gives:

Error in newdata %*% log.lik : 
  requires numeric/complex matrix/vector arguments

Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!

Here is a link to the dataset.

1
You should provide enough data to make your example reproducible. It likely has something to do with your data but since we can't see that it's impossible to say for sure. - MrFlick
@MrFlick I've edited the post to include a direct link to the .csv file. Is there any additional information I should be providing? New to this! - Matt
newdata the second argument to predict() cannot be a factor, which test class is, instead it needs to be a dfm. See ??predict.textmodel_NB_fitted. If your final line is predict(bbcNb) should work - but doesn't. Apparently there is a bug in the predict method when k >2. Please file an issue at github.com/kbenoit/quanteda/issues. - Ken Benoit
Thanks @KenBenoit! If I wanted to keep the newdata argument for predict(), what would be the proper way to convert testclass? Would it be testclass_dfm <- dfm(as.matrix(testclass))? Doing so gives the following error using predict(): "Error in newdata %*% log.lik : Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90" - Matt

1 Answers

4
votes

As a stylistic note, you don't need to separately load the labels/classes/categories, the corpus will have them as one of its docvars:

library("quanteda")

text <- readtext::readtext('bbc_articles_labels_all.csv', text_field='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, remove = stopwords("english"), stem = TRUE)

all_classes <- docvars(bbc_corpus)$category
trainclass <- factor(replace(all_classes, 1780:length(all_classes), NA))
bbcNb <- textmodel_nb(bbc_dfm, trainclass)

You don't even need to specify a second argument to predict. If you don't, it will use the whole original dfm:

bbc_pred <- predict(bbcNb)

Finally, you may want to assess the predictive accuracy. This will give you a summary of the model's performance on the test set:

library(caret)

confusionMatrix(
    bbc_pred$docs$predicted[1781:2225],
    all_classes[1781:2225]
)

However, as @ken-benoit noted, there is a bug in quanteda which prevents prediction from working with more than two classes. Until that's fixed, you could binarize the classes with something like:

docvars(bbc_corpus)$category <- factor(
    ifelse(docvars(bbc_corpus)$category=='sport', 'sport', 'other')
)

(note that this must be done before you extract all_classes from bbc_corpus above).