how to extract ngrams from a text in R (newspaper articles)

Question

I am new to R and used the quanteda package in R to create a corpus of newspaper articles. From this I have created a dfm:

dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE)

I am trying to extract bigrams (e.g. "climate change", "global warming") but keep getting an error message when I type the following, saying the ngrams argument is not used.

dfmatrix <- dfm(corpus, remove = stopwords("english"),stem = TRUE, remove_punct=TRUE, remove_numbers = FALSE, ngrams = 2)

I have installed the tokenizer, tidyverse, dplyr, ngram, readtext, quanteda and stm libraries. Below is a screenshot of my corpus. Doc_iD is the article titles. I need the bigrams to be extracted from the "texts" column.

Do I need to extract the ngrams from the corpus first or can I do it from the dfm? Am I missing some piece of code that allows me to extract the bigrams?

Ken Benoit Ken Benoit · Accepted Answer · 2020-06-05T16:25:06

Strictly speaking, if ngrams are what you want, then you can use tokens_ngrams() to form them. But sounds like you rather get more interesting multi-word expressions than "of the" etc. For that, I would use textstat_collocations(). You will want to do this on tokens, not on a dfm - the dfm will have already split your tokens into bag of words features, from which ngrams or MWEs can no longer be formed.

Here's an example from the built-in inaugural corpus. It removes stopwords but leaves a "pad" so that words that were not adjacent before the stopword removal will not appear as adjacent after their removal.

library("quanteda")
## Package version: 2.0.1

toks <- tokens(data_corpus_inaugural) %>%
  tokens_remove(stopwords("en"), padding = TRUE)

colls <- textstat_collocations(toks)
head(colls)
##          collocation count count_nested length   lambda        z
## 1      united states   157            0      2 7.893348 41.19480
## 2             let us    97            0      2 6.291169 36.15544
## 3    fellow citizens    78            0      2 7.963377 32.93830
## 4    american people    40            0      2 4.426593 23.45074
## 5          years ago    26            0      2 7.896667 23.26947
## 6 federal government    32            0      2 5.312744 21.80345

These are by default scored and sorted in order of descending score.

To "extract" them, just take the collocation column:

head(colls$collocation, 50)
##  [1] "united states"         "let us"                "fellow citizens"      
##  [4] "american people"       "years ago"             "federal government"   
##  [7] "almighty god"          "general government"    "fellow americans"     
## [10] "go forward"            "every citizen"         "chief justice"        
## [13] "four years"            "god bless"             "one another"          
## [16] "state governments"     "political parties"     "foreign nations"      
## [19] "solemn oath"           "public debt"           "religious liberty"    
## [22] "public money"          "domestic concerns"     "national life"        
## [25] "future generations"    "two centuries"         "social order"         
## [28] "passed away"           "good faith"            "move forward"         
## [31] "earnest desire"        "naval force"           "executive department" 
## [34] "best interests"        "human dignity"         "public expenditures"  
## [37] "public officers"       "domestic institutions" "tariff bill"          
## [40] "first time"            "race feeling"          "western hemisphere"   
## [43] "upon us"               "civil service"         "nuclear weapons"      
## [46] "foreign affairs"       "executive branch"      "may well"             
## [49] "state authorities"     "highest degree"

how to extract ngrams from a text in R (newspaper articles)

3 Answers