Calculate word co-occurance matrix in r

Question

I would like to calculate a word co-occurance matrix in R. I have the following data frame of sentences -

dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")

Which gives me

The boy is tall.
The girl is short.
The tall boy and the short girl are friends.

What I want to do is to firstly, make a list of all of the unique words across all three sentences, namely

The
boy
is
tall
girl
short
and
are
friends

I would then like to create word co-occurance matrix which counts how many times words co-occur in a sentence in total which would look something like this

       The   boy    is    tall    girl    short    and    are    friends
The     0     2      2      2        2        2      1      1    1
boy     2     0      1      2        1        1      1      1    1
is      2     1      0      2        1        1      0      0    0
tall    2     2      1      0        1        1      1      1    1
etc.

for all of the words, where a word cannot co-occur with itself. Note that in sentence 3, where the word "the" appears twice, the solution should only calculate the co-occurances once for that "the".

Does anyone have an idea how I could do this. I am working with a dataframe of around 3000 sentences.

what have you tried, why didn't it work? you need to show some effort, here :-) — agenis
With base-R, try to split each sentence with strsplit and whitespace and remove dots, comma and such with gsub. For the unique list of words you can use then the unique command. — Daniel Fischer
Another option: lst <- strsplit(tolower(dat[,1]), "[^[:alnum:]]");labs <- unique(unlist(lst));m <- do.call(rbind, lapply(lst, is.element, el=labs));m <- crossprod(m);dimnames(m) <- rep(list(labs), 2);diag(m) <- 0;m. It's in the veins of @ira - dunno why he/she deleted his/her post. — lukeA

Hack-R Hack-R · Accepted Answer · 2016-11-07T12:11:04

library(tm)
library(dplyr)
dat      <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")

ds  <- Corpus(DataframeSource(dat))
dtm <- DocumentTermMatrix(ds, control=list(wordLengths=c(1,Inf)))

X         <- inspect(dtm)
out       <- crossprod(X)  # Same as: t(X) %*% X
diag(out) <- 0             # rm own-word occurences
out

        Terms
Terms    boy friend girl short tall the
  boy      0      1    1     1    2   2
  friend   1      0    1     1    1   1
  girl     1      1    0     2    1   2
  short    1      1    2     0    1   2
  tall     2      1    1     1    0   2
  the      2      1    2     2    2   0

You may also want to remove stop words like "the", i.e.

ds <- tm_map(ds, stripWhitespace)
ds <- tm_map(ds, removePunctuation)
ds <- tm_map(ds, stemDocument)
ds <- tm_map(ds, removeWords, c("the", stopwords("english")))
ds <- tm_map(ds, removeWords, c("the", stopwords("spanish")))

Calculate word co-occurance matrix in r

1 Answers