0
votes

I am trying to build a predictive word algorithm using a simple back off model, but am struggling to create the frequency table of the words to create the probabilities for selecting the next word. I need to create lists of ngrams with their appropriate frequency.

The task is part of a course so I cannot provide the data as it is sourced from the school. The sample I'm using is 10,000 sentences long and each sentence is of different lengths.

I have a solution but I know it's bad form because I'm looping with rbind and obviously that takes too long.

library(quanteda)
library(data.table)
ntimes2<- ngrams(tokenize(sampNews,removePunct = TRUE,removeNumbers = TRUE,
                      removeTwitter=TRUE),n=2)
listwords<- function(input){
    words<-data.frame(x=0)
    for (i in 1:10102){
            words<- rbind(words, input[i])
    }
    words<<-words
}
listwords(ntimes2)

However, I don't know how to extract the sentences from the tokenised list in another way.

I've tried using stylo txt.to.words however I cannot control the splitting rule well enough to exclude all the variations of punctuation. In particular I want to prevent the apostrophes from creating a word split.

words<-txt.to.words(sampNews,splitting.rule = "[[:space]]|(?=[^,[:^punct:]])")
words<-txt.to.words(sampNews,splitting.rule = "(_| |,|?|#|@)")

This works but only for the limited number of splitters.

words<-txt.to.words(sampNews,splitting.rule = "(_| )")

strsplit splits words however it holds the structure of the multiple lists, which would still mean I need to loop over the data to pull it into a master list / dataframe so that I can create a frequency table.

words<- strsplit(sampNews, "[[:space:]]|(?=[^,'[:^punct:]])", perl=TRUE)
[[5]]
 [1] "And"         "when"        "it's"        "often"       "difficult"  
 [6] "to"          "predict"     "a"           "law's"       "impact,"    
[11] "legislators" "should"      "think"       "twice"       "before"     
[16] "carrying"    "any"         "bill"        "."           ""           
[21] "Is"          "it"          "absolutely"  "necessary"   "?"          
[26] ""            "Is"          "it"          "an"          "issue"      
[31] "serious"     "enough"      "to"          "merit"       "their"      

[[6]]
 [1] "There"      "was"        "a"          "certain"    "amount"    
 [6] "of"         "scoffing"   "going"      "around"     "a"         

I have tried sapply/lapply/ rbindlist but it's possible I've not used them correctly so please do suggest solutions including those.

Any advice is much appreciated.

J

Adding some of the data to give a feel

sampNews[1:2]

[1] "He wasn't home alone, apparently."                                                                                                                        
[2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
> class(sampNews)
[1] "character"
1

1 Answers

0
votes

turns out that txt.to.words.ext works well enough when the language="English.contr"

Not specifiying the language though sets it as English and treats it the same as the basic txt.to.words