I am trying to build a predictive word algorithm using a simple back off model, but am struggling to create the frequency table of the words to create the probabilities for selecting the next word. I need to create lists of ngrams with their appropriate frequency.
The task is part of a course so I cannot provide the data as it is sourced from the school. The sample I'm using is 10,000 sentences long and each sentence is of different lengths.
I have a solution but I know it's bad form because I'm looping with rbind and obviously that takes too long.
library(quanteda)
library(data.table)
ntimes2<- ngrams(tokenize(sampNews,removePunct = TRUE,removeNumbers = TRUE,
removeTwitter=TRUE),n=2)
listwords<- function(input){
words<-data.frame(x=0)
for (i in 1:10102){
words<- rbind(words, input[i])
}
words<<-words
}
listwords(ntimes2)
However, I don't know how to extract the sentences from the tokenised list in another way.
I've tried using stylo txt.to.words however I cannot control the splitting rule well enough to exclude all the variations of punctuation. In particular I want to prevent the apostrophes from creating a word split.
words<-txt.to.words(sampNews,splitting.rule = "[[:space]]|(?=[^,[:^punct:]])")
words<-txt.to.words(sampNews,splitting.rule = "(_| |,|?|#|@)")
This works but only for the limited number of splitters.
words<-txt.to.words(sampNews,splitting.rule = "(_| )")
strsplit splits words however it holds the structure of the multiple lists, which would still mean I need to loop over the data to pull it into a master list / dataframe so that I can create a frequency table.
words<- strsplit(sampNews, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)
[[5]]
[1] "And" "when" "it's" "often" "difficult"
[6] "to" "predict" "a" "law's" "impact,"
[11] "legislators" "should" "think" "twice" "before"
[16] "carrying" "any" "bill" "." ""
[21] "Is" "it" "absolutely" "necessary" "?"
[26] "" "Is" "it" "an" "issue"
[31] "serious" "enough" "to" "merit" "their"
[[6]]
[1] "There" "was" "a" "certain" "amount"
[6] "of" "scoffing" "going" "around" "a"
I have tried sapply/lapply/ rbindlist but it's possible I've not used them correctly so please do suggest solutions including those.
Any advice is much appreciated.
J
Adding some of the data to give a feel
sampNews[1:2]
[1] "He wasn't home alone, apparently."
[2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
> class(sampNews)
[1] "character"