I have a dataset of BBC articles with two columns: 'category' and 'text'. I need to construct a Naive Bayes algorithm that predicts the category (i.e. business, entertainment) of an article based on type.
I'm attempting this with Quanteda and have the following code:
library(quanteda)
bbc_data <- read.csv('bbc_articles_labels_all.csv')
text <- textfile('bbc_articles_labels_all.csv', textField='text')
bbc_corpus <- corpus(text)
bbc_dfm <- dfm(bbc_corpus, ignoredFeatures = stopwords("english"), stem=TRUE)
# 80/20 split for training and test data
trainclass <- factor(c(bbc_data$category[1:1780], rep(NA, 445)))
testclass <- factor(c(bbc_data$category[1781:2225]))
bbcNb <- textmodel_NB(bbc_dfm, trainclass)
bbc_pred <- predict(bbcNb, testclass)
It seems to work smoothly until predict(), which gives:
Error in newdata %*% log.lik :
requires numeric/complex matrix/vector arguments
Can anyone provide insight on how to resolve this? I'm still getting the hang of text analysis and quanteda. Thank you!
newdatathe second argument topredict()cannot be a factor, whichtest classis, instead it needs to be a dfm. See??predict.textmodel_NB_fitted. If your final line ispredict(bbcNb)should work - but doesn't. Apparently there is a bug in the predict method when k >2. Please file an issue at github.com/kbenoit/quanteda/issues. - Ken Benoitnewdataargument forpredict(), what would be the proper way to converttestclass? Would it betestclass_dfm <- dfm(as.matrix(testclass))? Doing so gives the following error usingpredict(): "Error in newdata %*% log.lik : Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90" - Matt