0
votes

I am trying to use the fantastic Quanteda to look at co-occurance of terms in news articles.

I can find the features which co-occur with "美国" (the United States) as follows:

ch14_corp <- corpus(data_14)
ch14_toks <- tokens(ch14_corp, remove_punct = TRUE) %>%
+ tokens_remove(ch_stop)
ch14_fcm <- fcm(ch14_toks, context = "window")

and then get the features that co-occur most frequently


topfeatures(ch14_fcm["美国", ], n=50)

朝鲜     美国     日本     中国     韩国     问题       马     政府     国家     报道 
     881      804      555      552      297      288      270      254      253      243 
      奥     总统       称     战略     表示       韩     关系     政策     认为     进行 
     238      238      234      227      214      174      173      169      162      160 
      中       核     亚太 国家安全     经济     安全       局     世界     发言   国务院 
     157      153      148      137      136      136      136      135      132      129 
      美       国     访问   俄罗斯     军事     国际     官员     媒体     公民     人权 
     126      122      121      120      120      118      118      114      114      114 
    联合     一个       名     地区     安倍     平衡     导弹     国防       斯     克里 
     112      112      112      111      110      110      107      105      104      102

Could anybody tell me how convert this to a 'data.frame'? Or a table with the 'feature' in column A and then the number of times it co-occurs with '美国' in column B?

I guess the other way might be to not use 'topfeatures' but to get just the row (or column?) of the matrix which has all the terms that co-occur with '美国', then to sort these based on the number of times they co-occur?

2

2 Answers

1
votes

That's more or less right. Here's how I'd do it using a built-in example, and you can substitute your text and different parameters (e.g. n) as needed.

Note the use of padding = TRUE: this leaves a blank in the space where punctuation or stopwords were removed, so that the proximities are not inflated for words formerly separated by one of the removed tokens.

library("quanteda")
## Package version: 2.1.1

ch14_corp <- head(data_corpus_inaugural)
ch14_toks <- tokens(ch14_corp, remove_punct = TRUE, padding = TRUE) %>%
  tokens_remove(stopwords("en"), padding = TRUE) %>%
  tokens_tolower()
ch14_fcm <- fcm(ch14_toks, context = "window")

topf <- topfeatures(ch14_fcm["united", ], n = 6)

data.frame(Term = names(topf), Freq = topf, row.names = NULL) %>%
  dplyr::arrange(desc(Freq))
##           Term Freq
## 1       states    8
## 2   government    3
## 3 constitution    3
## 4   instituted    1
## 5       enable    1
## 6         step    1
0
votes

I think it works if I do it the following way?

df <- as.data.frame(t(mat_term)
colnames(df)[1] <- "Term"  
colnames(df)[2] <- "Freq"  
us_co <- df[order(-df$Freq),]  
us_co[1:100,] 

Could somebody confirm that this is correct to give me a data frame of the top 100 features which co-occur with the term "美国" (the US)?