2
votes

I am trying to remove typos from my data text analysis. So I am using dictionary feature of quanteda package. It works fine for Unigrams. But it gives unexpected output for Bigrams. Not sure how to handle typos so that they do not sneak into my Bigrams and Trigrams.

ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.", 
                "New York City has raised a taxes: an income tax and a sales tax.")

ZcObj <- corpus(ZTestCorp1)

mydict <- dictionary(list("the"="the", "new"="new", "law"="law", 
                      "capital"="capital", "gains"="gains", "tax"="tax", 
                      "inheritance"="inheritance", "city"="city")) 

Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ", 
         what = "fastestword", 
         toLower=TRUE, removeNumbers=TRUE,
         removePunct=TRUE, removeSeparators=TRUE,
         removeTwitter=TRUE, stem=FALSE,
         ignoredFeatures=NULL,
         language="english", 
         dictionary=mydict, valuetype="fixed")

wordsFreq1 <- colSums(sort(Zdfm1))

Current output

> wordsFreq1
    the         new         law     capital       gains         tax inheritance        city 
      0           0           0           0           0           0           0           0 

Without using dictionary, the output is as follows:

> wordsFreq
    tax and         the new         new law    law included      included a       a capital 
          2               1               1               1               1               1 
capital gains       gains tax          and an  an inheritance inheritance tax        new york 
          1               1               1               1               1               1 
  york city        city has      has raised        raised a         a taxes        taxes an 
          1               1               1               1               1               1 
  an income      income tax           and a         a sales       sales tax 
          1               1               1               1               1

Expected Bigram

The new
new law
law capital
capital gains
gains tax
tax inheritance
inheritance city  

p.s. I was assuming that the tokenizing is done after dictionary match up. But looks like that is not the case based on the results I see.

On another note, I tried to create my dictionary object as

mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains", 
                      "tax", "inheritance", "city"))) 

But it did not work. So I had to use the approach above which I think is not efficient.

UPDATE Added output based on Ken's solution:

> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2, 
+                keptFeatures = c("the", "new", "law", "capital", "gains",  "tax", "inheritance", "city")))
Document-feature matrix of: 2 documents, 14 features.
2 x 14 sparse Matrix of class "dfmSparse" features
docs    the_new new_law law_included a_capital capital_gains gains_tax   tax_and an_inheritance
text1       1       1            1         1             1         1       1               1
text2       0       0            0         0             0         0       1              0
   features
docs    inheritance_tax new_york york_city city_has income_tax sales_tax
text1               1        0         0        0          0         0
text2               0        1         1        1          1         1
1

1 Answers

5
votes

Updated 2017-12-21 for newer versions of quanteda

Glad to see you are working with the package! I think there are two issues in what you are struggling with. The first is how to apply feature selection before forming ngrams. The second is how to define feature selections generally (using quanteda).

The first issue: How to apply feature selection before forming ngrams. Here you have defined a dictionary to do this. (As I will show below, this is not necessary here.) You would like to remove all terms not in the selection list, and then form bigrams. quanteda does not do this by default because it's not a standard form of "bigram", where the words are not collocated according to some window defined strictly by adjacency. In your expected result, for instance, law capital is not a pair of adjacent terms, which is the usual definition of bigram.

However we can override this behaviour by building up the document-feature matrix more "manually".

First, tokenise the texts.

# tokenize the original
toks <- tokens(ZcObj, removePunct = TRUE, removeNumbers = TRUE) %>%
  tokens_tolower()
toks
## tokens object from 2 documents.
## text1 :
##  [1] "the"         "new"         "law"         "included"    "a"           "capital"     "gains"       "tax"         "and"         "an"          "inheritance" "tax"        
## 
## text2 :
##  [1] "new"    "york"   "city"   "has"    "raised" "a"      "taxes"  "an"     "income" "tax"    "and"    "a"      "sales"  "tax"  

Now we apply your dictionary mydict to the tokenized texts using tokens_select():

(toksDict <- tokens_select(toks, mydict, selection = "keep"))
## tokens object from 2 documents.
## text1 :
##  [1] "the"         "new"         "law"         "capital"     "gains"       "tax"         "inheritance" "tax"        
## 
## text2 :
##  [1] "new"  "city" "tax"  "tax" 

From this selected set of tokens, we can now form the bigrams (or we could feed toksDict directly to dfm()):

(toks2 <- tokens_ngrams(toksDict, n = 2, concatenator = " "))
## tokens object from 2 documents.
## text1 :
##  [1] "the new"         "new law"         "law capital"     "capital gains"   "gains tax"       "tax inheritance" "inheritance tax"
## 
## text2 :
##  [1] "new city" "city tax" "tax tax" 

# now create the dfm
(myDfm2 <- dfm(toks2))
## Document-feature matrix of: 2 documents, 10 features.
## 2 x 10 sparse Matrix of class "dfm"
##        features
## docs    the new new law law capital capital gains gains tax tax inheritance inheritance tax new city city tax tax tax
##   text1       1       1           1             1         1               1               1        0        0       0
##   text2       0       0           0             0         0               0               0        1        1       1
topfeatures(myDfm2)
#     the new         new law     law capital   capital gains       gains tax tax inheritance inheritance tax        new city        city tax         tax tax 
#           1               1               1               1               1               1               1               1               1               1 

The feature list is now very close to what you wanted.

The second issue is why your dictionary approach seems inefficient. This is because you are creating a dictionary to perform feature selection but not really using this as a dictionary -- in other words a dictionary where each key equals its own key as a value is not really a dictionary. Simply feed it a character vector of selection tokens instead, and it works fine, e.g.:

(myDfm1 <- dfm(ZcObj, verbose = FALSE, 
               keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city")))
## Document-feature matrix of: 2 documents, 8 features.
## 2 x 8 sparse Matrix of class "dfm"
##        features
## docs    the new law capital gains tax inheritance city
##   text1   1   1   1       1     1   2           1    0
##   text2   0   1   0       0     0   2           0    1