I would like to calculate a word co-occurance matrix in R. I have the following data frame of sentences -
dat <- as.data.frame("The boy is tall.", header = F, stringsAsFactors = F)
dat[2,1] <- c("The girl is short.")
dat[3,1] <- c("The tall boy and the short girl are friends.")
Which gives me
The boy is tall.
The girl is short.
The tall boy and the short girl are friends.
What I want to do is to firstly, make a list of all of the unique words across all three sentences, namely
The
boy
is
tall
girl
short
and
are
friends
I would then like to create word co-occurance matrix which counts how many times words co-occur in a sentence in total which would look something like this
The boy is tall girl short and are friends
The 0 2 2 2 2 2 1 1 1
boy 2 0 1 2 1 1 1 1 1
is 2 1 0 2 1 1 0 0 0
tall 2 2 1 0 1 1 1 1 1
etc.
for all of the words, where a word cannot co-occur with itself. Note that in sentence 3, where the word "the" appears twice, the solution should only calculate the co-occurances once for that "the".
Does anyone have an idea how I could do this. I am working with a dataframe of around 3000 sentences.
strsplitand whitespace and remove dots, comma and such withgsub. For the unique list of words you can use then theuniquecommand. - Daniel Fischerlst <- strsplit(tolower(dat[,1]), "[^[:alnum:]]");labs <- unique(unlist(lst));m <- do.call(rbind, lapply(lst, is.element, el=labs));m <- crossprod(m);dimnames(m) <- rep(list(labs), 2);diag(m) <- 0;m. It's in the veins of @ira - dunno why he/she deleted his/her post. - lukeA