0
votes

I’m trying to do sentiment analysis in Quanteda and have encountered an error I can’t solve using the 2015 Lexicoder Sentiment Dictionary. The dictionary has four keys: negative, positive, negative positive ( positive word preceded by a negation (used to convey negative sentiment) and, negative negative ( a negative word preceded by a negation, used to convey positive sentiment ).

I can’t get the final two categories to activate when I use the dictionary.

Here is the script I’m using

The package LexisNexisTools converts it in to a quanteda corpus. When I was experimenting with the error, I wasn’t getting any neg_pos or neg_negative hits, so I added the example sentence “This aggressive policy will not win friends” - which has one neg_positive bigram ('will not') - from the reference on the quanteda page to the first line of first document. This is registered in the first dfm and can be seen in the toks_dict tokens list. However, there are more instances of the exact same bigram (will not) in the corpus that are not counted. Moreover, there are other neg_pos and neg_neg phrases in the corpus which are not registered at all.

I’m not sure how this is resolved at all. Curiously, in the third dfm dfm_dict, the initial ‘will not’ is not registered as a neg_positive at all. The overall counts for the categories negative and positive are not changed, so this isn’t a case of the missing values being counted elsewhere. I’m really at a miss on what I’m doing wrong - any help would be greatly appreciated!



rm(list=ls())

library(quanteda)
library(quanteda.corpora)
library(readtext)
library(LexisNexisTools)
library(tidyverse)
library(RColorBrewer)

LNToutput <-lnt_read("word_labour.docx")

corp <- lnt_convert(LNToutput, to = "quanteda")

#uses the package lexisnexistools to create the corpus from the format needed


dfm <- dfm(corp, dictionary = data_dictionary_LSD2015)
dfm

toks_dict <- tokens_lookup(tokens(corp), dictionary = data_dictionary_LSD2015, exclusive= FALSE )
toks_dict

dfm_dict <- dfm(toks_dict, dictionary = data_dictionary_LSD2015, exclusive = FALSE )
dfm_dict


https://www.dropbox.com/s/qdwetdn8bt9fdrd/word_labour.DOCX?dl=0

This is a link to the word document that forms the raw text for the corpus.

1

1 Answers

1
votes

Works fine for me. By running kwic() on the compounded dictionary keys, you can see where the matches are occurring.

library("quanteda", warn.conflicts = FALSE)
## Package version: 2.1.0
## Parallel computing: 2 of 8 threads used.
## See https://quanteda.io for tutorials and examples.

corp <- readtext::readtext("https://www.dropbox.com/s/qdwetdn8bt9fdrd/word_labour.docx?dl=1") %>%
  corpus()

toks <- tokens(corp)

kwic(toks, pattern = data_dictionary_LSD2015["neg_positive"])
##                                                                                
##        [word_labour.docx, 82:83] Body This aggressive policy will |  not win  |
##    [word_labour.docx, 8468:8469]                manifesto as" as" | not worth |
##    [word_labour.docx, 9681:9682]       more high street services. | Not clear |
##    [word_labour.docx, 9778:9779]     will get one-to-one tuition. | Not clear |
##    [word_labour.docx, 9841:9842]      children free school meals. | Not clear |
##  [word_labour.docx, 10338:10339]      western Balkans and Turkey. | Not clear |
##  [word_labour.docx, 13463:13464]              in January. What is | not clear |
##                                   
##  friends. Ed Miliband has         
##  the paper it is written          
##  - Labour has criticised the      
##  - then shadow education secretary
##  - Labour appeared to back        
##  - this is not a                  
##  is if it allows a
kwic(toks, pattern = data_dictionary_LSD2015["neg_negative"])
##                                                                   
##  [word_labour.docx, 10772:10773] over again. It is | not unusual |
##                         
##  for voters to trust the

The dfm reflects this:

tokens_lookup(toks, dictionary = data_dictionary_LSD2015) %>%
  dfm()
## Document-feature matrix of: 1 document, 4 features (0.0% sparse).
##                   features
## docs               negative positive neg_positive neg_negative
##   word_labour.docx      512      687            7            1

ps I used the readtext package to avoid all of the rest of what you were doing, which was not necessary for this question.