2
votes

Rewrite of the original post. I'm looking to eliminate a plyr dependency.

I tried to splice the tapply into my code as well as the lapply. The tapply worked for one variable (sex) but not 2 (sex, adult). Slipping the lapply response in does not return a word list by grouping variable it just returns one big word list with the grouping variable at the top (so for person it returns one word list instead of one word list for each person).

I apologize for the length of this but without including the real function I'm working on it doesn't seem to give you guys the insight to help me.

I'm going to include my attempts to alter the function with your suggestions in an answer instead of here to reduce an already bloated post. Also please don't comment on the extra user defined funtions unless helpful to the main problem. They're works in progress and included just to show you what the problem is.

CORRECT OUTPUT WITH PLYR: http://pastebin.com/mr9FvjpF

Dataframe

DATA<-structure(list(person = structure(c(4L, 1L, 5L, 4L, 1L, 3L, 1L,  
4L, 3L, 2L, 1L), .Label = c("greg", "researcher", "sally", "sam",  
"teacher"), class = "factor"), sex = structure(c(2L, 2L, 2L,  
2L, 2L, 1L, 2L, 2L, 1L, 1L, 2L), .Label = c("f", "m"), class = "factor"),  
adult = c(0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L), state = structure(c(2L,  
7L, 9L, 11L, 5L, 4L, 8L, 3L, 10L, 1L, 6L), .Label = c("Shall we move on?  Good then.",  
"Computer is fun. Not too fun.", "I distrust you.",  
"How can we be certain?", "I am telling the truth!", "Im hungry.  Lets eat.  You already?",  
"No its not, its ****.", "There is no way.", "What should we do?",  
"What are you talking about?", "You liar, it stinks!" 
), class = "factor"), code = structure(c(1L, 4L, 5L, 6L,  
7L, 8L, 9L, 10L, 11L, 2L, 3L), .Label = c("K1", "K10", "K11",  
"K2", "K3", "K4", "K5", "K6", "K7", "K8", "K9"), class = "factor")), .Names = c("person",  
"sex", "adult", "state", "code"), row.names = c(NA, -11L), class = "data.frame") 
#=====================

DEPENDENT USER DEFINED TOOLS

Trim<-function (x) gsub("^\\s+|\\s+$", "", x)

bracketX<-function(text, bracket='all'){
    switch(bracket,
        square=sapply(text, function(x)gsub("\\[.+?\\]", "", x)),
        round=sapply(text, function(x)gsub("\\(.+?\\)", "", x)),
        curly=sapply(text, function(x)gsub("\\{.+?\\}", "", x)),
        all={P1<-sapply(text, function(x)gsub("\\[.+?\\]", "", x))
             P1<-sapply(P1, function(x)gsub("\\(.+?\\)", "", x))
             sapply(P1, function(x)gsub("\\{.+?\\}", "", x))})                                                                                                                                                           
}

words <- function(x){as.vector(unlist(strsplit(x, " ")))}

word.split <- function(x) lapply(x, words)

strip <- function(x){
         sentence <- gsub('[[:punct:]]', '', as.character(x))  
         sentence <- gsub('[[:cntrl:]]', '', sentence)  
         sentence <- gsub('\\d+', '', sentence)  
         Trim(tolower(sentence))
}
#=====================

FUNCTION OF INTEREST

textLISTER <- function(dataframe = DFwcweb, text.var = "dialogue", group.vars = "person") {
    require(plyr)
    DF <- dataframe
    DF$words <- Trim(as.character(bracketX(dataframe[, text.var])))
    DF$words <- as.vector(word.split(strip(DF$words)))

    #I'd like to get ride of the plyr dependency in the line below
    dlply(DF, c(group.vars), summarise, words = as.vector(unlist(DF$words)))
} 
#=====================

CURRENTLY THE CODE WORKS WITH ONE OR MORE GROUPING VARIABLES.

textLISTER(DATA, 'state', 'person')
textLISTER(DATA, 'state', c('sex','adult'))
4
Why don't you want to go beyond base functions? Plyr is great, freely available, installing your package from CRAN will get it as well, or if your package is private a simple install.packages will get it - you could even test for that in your package startup... - Spacedman
Yea I love plyr too. I only start searching for other options when I run into speed issues. - John Colby
@Spacedman It's better practice to code in ways that require the user to use less packages. Even Wickham himself seems to be moving this way in his talk he just gave about ggplot. He doesn't want the user to be required to use plyr or reshape (not sure which one) unless they explicitly call these packages (hopefully I'm nor misparaphrasing him). - Tyler Rinker
I appreciate the help everyone but it seems my attempt to provide minimal code makes the problem people solved not actually work for my problem. I've sliced it down as much as I think I can but wanted to include everything in case something I'm doing with user defined functions is messing up what I want to do. I edited the post to look like my actually post. Thank you for your help thus far and I apologize my minimal code didn't convey what was needed to solve the problem. - Tyler Rinker
Just to chime in regarding the quote "it's better practice to code...[with] less packages." In general practice, the number of dependencies isn't as big a deal as code maintainability. Clean and succinct code is more maintainable than the alternatives. When there are package dependencies, regression tests can help identify when changes will break the dependent code. Keeping code simple also makes it easier to evolve the code. Base R is good, but there are times when it is helpful to use other functionality, such as for code simplicity, speed (e.g. the .parallel flags), and reusability. - Iterator

4 Answers

3
votes

How about

d1 <- dlply(DF, .(sex, adult), summarise, words=as.vector(unlist(dia2word)))
d2 <- dlply(DF, .(person), summarise, words=as.vector(unlist(dia2word)))
ff <- function(x) {
    u <- unlist(x)
    data.frame(words=u,
             row.names=seq(length(u)),
             stringsAsFactors=FALSE)
}
d1B <- with(DF,lapply(split(dia2word,list(adult,sex)),ff))
all.equal(d1,d1B,check.attributes=FALSE) ## TRUE
d2B <- with(DF,lapply(split(dia2word,person),ff))
all.equal(d2,d2B,check.attributes=FALSE) ## TRUE

edit: I haven't looked at your code closely, but it seems that your issue might be with specifying the components to be isolated as strings. Here is a variant that might work better in code.

target <- "dia2word"
categ <- c("adult","sex")
d1C <- lapply(split(DF[[target]],lapply(categ,getElement,object=DF)),ff)
all.equal(d1,d1B,d1C,check.attributes=FALSE)
categ <- "person"
d2C <- lapply(split(DF[[target]],lapply(categ,getElement,object=DF)),ff)
all.equal(d2,d2B,d2C,check.attributes=FALSE)
2
votes

tapply should get you there.

> tapply(DF$dia2word, DF[, c('sex', 'adult')], function(x) as.vector(unlist(x)))
   adult
sex 0            1          
  f Character,10 Character,7
  m Character,35 Character,4

Then it'll just be a bit more formatting if you want to mimic the 1d named list as well...

0
votes

Not an answer but an attempt to include the suggestions into an answer

ATTEMPT WITH THE lapply suggestion

textLISTER<-function(dataframe, text.var, group.vars){
    #require(plyr)
    DF<-dataframe
    DF$dia2word<-Trim(as.character(bracketX(dataframe[,text.var])))
    DF$dia2word<-as.vector(word.split(strip(DF$dia2word)))
    #dlply(DF, c(group.vars), summarise, words=as.vector(unlist(dia2word)))

ff <- function(x) {
    u <- unlist(x)
    data.frame(words=u,
             row.names=seq(length(u)),
             stringsAsFactors=FALSE)
}
with(DF,lapply(split(dia2word,list(group.vars)),ff))
}
#================================================================
#THE TEST
textLISTER(DATA, 'state', 'person')
textLISTER(DATA, 'state', c('sex','adult'))

ATTEMPT WITH THE tapply suggestion

textLISTER <- function(dataframe, text.var, group.vars) {
    #require(plyr)
    DF <- dataframe
    DF$dia2word <- Trim(as.character(bracketX(dataframe[, text.var])))
    DF$dia2word <- as.vector(word.split(strip(DF$dia2word)))
    #dlply(DF, c(group.vars), summarise,
    #   words=as.vector(unlist(dia2word)))
    tapply(DF$dia2word, DF[, c(group.vars)], function(x) as.vector(unlist(x)))
} 
#================================================================
#THE TEST
textLISTER(DATA, 'state', 'person')
textLISTER(DATA, 'state', c('sex','adult'))
0
votes

This is what worked using Ben Bolker's suggestion. Posting this to complete the thread.

textLISTER <- function(dataframe, text.var, group.vars) {
    reducer <- function(x) gsub(" +", " ", x)
    DF <- dataframe
    DF$dia2word <- Trim(as.character(bracketX(dataframe[, text.var])))
    DF$dia2word <- as.vector(word.split(reducer(strip(DF$dia2word))))

    ff <- function(x) {
        u <- unlist(x)
        data.frame(words = u, row.names = seq(length(u)), stringsAsFactors = FALSE)
    }
    lapply(split(DF[["dia2word"]], lapply(group.vars, getElement, 
        object = DF)), ff)
} 

Thank you all for bearing with me through an obviously bloated post. I hated doing that but it seemed to me to be the only way to capture what was happening.