1
votes

I was reading the GloVe word embedding (Pennington, 2014). I wanted to reproduce the co-occurence probabilities (and ratio) of the ice-water example explained in the paper (I pasted a screenshot).

GloVe example

I understand that it is the probability of a particular word (e.g., ice) appearing in the context of another word (e.g., solid). But when I implement this logic in R (or Python) on the Wikipedia dataset, the amount of text is huge to apply this formula as explained in the paper to get any meaningful result. Is there any workaround to reproduce this example mentioned in the paper in R or Python (preferably R)?

This is the code that I tried

library(text2vec)
text8_file = "~/text8"
if (!file.exists(text8_file)) {
  download.file("http://mattmahoney.net/dc/text8.zip", "~/text8.zip")
  unzip ("~/text8.zip", files = "text8", exdir = "~/")
}
wiki = readLines(text8_file, n = 1, warn = FALSE)


tokens = space_tokenizer(wiki)

it = itoken(tokens, progressbar = FALSE)
vocab = create_vocabulary(it)
vocab <- prune_vocabulary(vocab, term_count_min = 10)
vectorizer <- vocab_vectorizer(vocab)

tcm <- create_tcm(it, vectorizer, skip_grams_window = 5L,weights = c(1,1,1,1,1))

chk<-as.matrix(tcm) + t(as.matrix(tcm)) # Issue is here-> cannot convert sparse matrix into a regular matrix due to huge size of the matrix.

num<-chk["ice","solid"]/sum(chk["solid",])
den<-chk["ice","steam"]/sum(chk["steam",])

num/den