I am trying to remove typos from my data text analysis. So I am using dictionary feature of quanteda package. It works fine for Unigrams. But it gives unexpected output for Bigrams. Not sure how to handle typos so that they do not sneak into my Bigrams and Trigrams.
ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised a taxes: an income tax and a sales tax.")
ZcObj <- corpus(ZTestCorp1)
mydict <- dictionary(list("the"="the", "new"="new", "law"="law",
"capital"="capital", "gains"="gains", "tax"="tax",
"inheritance"="inheritance", "city"="city"))
Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ",
what = "fastestword",
toLower=TRUE, removeNumbers=TRUE,
removePunct=TRUE, removeSeparators=TRUE,
removeTwitter=TRUE, stem=FALSE,
ignoredFeatures=NULL,
language="english",
dictionary=mydict, valuetype="fixed")
wordsFreq1 <- colSums(sort(Zdfm1))
Current output
> wordsFreq1
the new law capital gains tax inheritance city
0 0 0 0 0 0 0 0
Without using dictionary, the output is as follows:
> wordsFreq
tax and the new new law law included included a a capital
2 1 1 1 1 1
capital gains gains tax and an an inheritance inheritance tax new york
1 1 1 1 1 1
york city city has has raised raised a a taxes taxes an
1 1 1 1 1 1
an income income tax and a a sales sales tax
1 1 1 1 1
Expected Bigram
The new
new law
law capital
capital gains
gains tax
tax inheritance
inheritance city
p.s. I was assuming that the tokenizing is done after dictionary match up. But looks like that is not the case based on the results I see.
On another note, I tried to create my dictionary object as
mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains",
"tax", "inheritance", "city")))
But it did not work. So I had to use the approach above which I think is not efficient.
UPDATE Added output based on Ken's solution:
> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2,
+ keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city")))
Document-feature matrix of: 2 documents, 14 features.
2 x 14 sparse Matrix of class "dfmSparse" features
docs the_new new_law law_included a_capital capital_gains gains_tax tax_and an_inheritance
text1 1 1 1 1 1 1 1 1
text2 0 0 0 0 0 0 1 0
features
docs inheritance_tax new_york york_city city_has income_tax sales_tax
text1 1 0 0 0 0 0
text2 0 1 1 1 1 1