2
votes

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:

dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE) 

I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.

dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)

I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries. Below is a screenshot of my corpus. Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.

enter image description here

Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?

3

3 Answers

1
votes

Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.

Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.

library("quanteda")
## Package version: 2.0.1

toks <- tokens(data_corpus_inaugural) %>%
  tokens_remove(stopwords("en"), padding = TRUE)

colls <- textstat_collocations(toks)
head(colls)
##          collocation count count_nested length   lambda        z
## 1      united states   157            0      2 7.893348 41.19480
## 2             let us    97            0      2 6.291169 36.15544
## 3    fellow citizens    78            0      2 7.963377 32.93830
## 4    american people    40            0      2 4.426593 23.45074
## 5          years ago    26            0      2 7.896667 23.26947
## 6 federal government    32            0      2 5.312744 21.80345

These are by default scored and sorted in order of descending score.

To "extract" them, just take the collocation column:

head(colls$collocation, 50)
##  [1] "united states"         "let us"                "fellow citizens"      
##  [4] "american people"       "years ago"             "federal government"   
##  [7] "almighty god"          "general government"    "fellow americans"     
## [10] "go forward"            "every citizen"         "chief justice"        
## [13] "four years"            "god bless"             "one another"          
## [16] "state governments"     "political parties"     "foreign nations"      
## [19] "solemn oath"           "public debt"           "religious liberty"    
## [22] "public money"          "domestic concerns"     "national life"        
## [25] "future generations"    "two centuries"         "social order"         
## [28] "passed away"           "good faith"            "move forward"         
## [31] "earnest desire"        "naval force"           "executive department" 
## [34] "best interests"        "human dignity"         "public expenditures"  
## [37] "public officers"       "domestic institutions" "tariff bill"          
## [40] "first time"            "race feeling"          "western hemisphere"   
## [43] "upon us"               "civil service"         "nuclear weapons"      
## [46] "foreign affairs"       "executive branch"      "may well"             
## [49] "state authorities"     "highest degree"
0
votes

I think you need to create the ngram directly from the corpus. This is an example adapted from the quanteda tutorial website:

library(quanteda)
corp <- corpus(data_corpus_inaugural)
toks <- tokens(corp)

tokens_ngrams(toks, n = 2)

Tokens consisting of 58 documents and 4 docvars.
1789-Washington :
 [1] "Fellow-Citizens_of" "of_the"             "the_Senate"         "Senate_and"         "and_of"             "of_the"             "the_House"         
 [8] "House_of"           "of_Representatives" "Representatives_:"  ":_Among"            "Among_the"         
[ ... and 1,524 more ]
0
votes

EDITED Hi this example from the help dfm may be useful

library(quanteda)


# You say you're already creating the corpus?
# where it says "data_corpus_inaugaral" put your corpus name

# Where is says "the_senate" put "climate change"
# where is says "the_house" put "global_warming"

tokens(data_corpus_inaugural) %>%
  tokens_ngrams(n = 2) %>%
  dfm(stem = TRUE, select = c("the_senate", "the_house"))

#> Document-feature matrix of: 58 documents, 2 features (89.7% sparse) and 4 docvars.
#>                  features
#> docs              the_senat the_hous
#>   1789-Washington         1        2
#>   1793-Washington         0        0
#>   1797-Adams              0        0
#>   1801-Jefferson          0        0
#>   1805-Jefferson          0        0
#>   1809-Madison            0        0
#> [ reached max_ndoc ... 52 more documents ]