Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

Question

I have a dataframe that contains survey responses with each row representing a different person. One column - "Text" - is an open-ended text question. I would like to use Tidytext::unnest_tokens so that I do text analysis by each row, including sentiment scores, word counts, etc.

Here is the simple dataframe for this example:

Satisfaction<-c ("Satisfied","Satisfied","Dissatisfied","Satisfied","Dissatisfied")
Text<-c("I'm very satisfied with the services", "Your service providers are always late which causes me a lot of frustration", "You should improve your staff training, service providers have bad customer service","Everything is great!","Service is bad")
Gender<-c("M","M","F","M","F")
df<-data.frame(Satisfaction,Text,Gender)

I then turned the Text column into character...

df$Text<-as.character(df$Text)

Next I grouped by the id column and nested the dataframe.

df<-df%>%mutate(id=row_number())%>%group_by(id)%>%unnest_tokens(word,Text)%>%nest(-id)

Getting this far seems to have worked ok, but now how do I use purrr::map functions to work on the nested list column "word"? For example, if I want to create a new column using dplyr::mutate with word counts for each row?

Also, is there a better way to nest the dataframe so that only the "Text" column is a nested list?

It is not very clear what you want. You can do text analysis without having to use purrr::nest, just stop after unnest_tokens. If you want to nest only the word column you can do nest(word), but for it to work you have to ungroup the data frame first (or do not group by id in the first place) — FlorianGD

Julia Silge Julia Silge · Accepted Answer · 2017-04-07T19:43:13

I love using purrr::map to do modeling for different groups, but for what you are talking about doing, I think you can stick with just straight dplyr.

You can set up your dataframe like this:

library(dplyr)
library(tidytext)

Satisfaction <- c("Satisfied",
                  "Satisfied",
                  "Dissatisfied",
                  "Satisfied",
                  "Dissatisfied")

Text <- c("I'm very satisfied with the services",
          "Your service providers are always late which causes me a lot of frustration", 
          "You should improve your staff training, service providers have bad customer service",
          "Everything is great!",
          "Service is bad")

Gender <- c("M","M","F","M","F")

df <- data_frame(Satisfaction, Text, Gender)

tidy_df <- df %>% 
    mutate(id = row_number()) %>% 
    unnest_tokens(word, Text)

Then to find, for example, the number of words per line, you can use group_by and mutate.

tidy_df %>%
    group_by(id) %>%
    mutate(num_words = n()) %>%
    ungroup
#> # A tibble: 37 × 5
#>    Satisfaction Gender    id      word num_words
#>           <chr>  <chr> <int>     <chr>     <int>
#> 1     Satisfied      M     1       i'm         6
#> 2     Satisfied      M     1      very         6
#> 3     Satisfied      M     1 satisfied         6
#> 4     Satisfied      M     1      with         6
#> 5     Satisfied      M     1       the         6
#> 6     Satisfied      M     1  services         6
#> 7     Satisfied      M     2      your        13
#> 8     Satisfied      M     2   service        13
#> 9     Satisfied      M     2 providers        13
#> 10    Satisfied      M     2       are        13
#> # ... with 27 more rows

You can do sentiment analysis by implementing an inner join; check out some examples here.

Using the Nested List Column Approach and Purrr Together with Tidytext::Unnest_Tokens

1 Answers