0
votes

I have a set of rows and columns of data of posts and comments on Reddit where a row represents a post and it's comment. Since one post could contain multiple comments, I have rows with the same id (post id) and different comment ids. I want to merge the rows with the same id for one row and have all the different comment ids in the column - 'comment id' separated by commas. But also since the post data (title, body etc.) are duplicated (as shown in the attached image), I don't need them merged as only one occurrence per row.

![duplicate rows][1] 1

I could merge the comment information for relevant columns separated by commas but I don't know how to get the one occurrence of the duplicated post information which does not need merging.

all_reddits <- all_posts_and_comments %>%
  group_by(id) %>%
  summarise(
    comment_id = paste(comment_id, collapse=","),
    comment_author = paste(comment_author, collapse = ","),
    comment_body = paste(comment_body, collapse = ","),
    comment_score = paste(comment_score, collapse = ","),
    comment_created_date = paste(comment_created_date, collapse = ","),
    comment_link = paste(comment_link, collapse=",")
  )

I have tried summarise_all() and summarise_at() of R: dplyr but I keep getting errors.

1

1 Answers

0
votes

I believe you can just use a summary statement like post_datapoint_xyz = first(post_datapoint_xyz) for each of the post-columns to address your problem.

What also works would be to simply include all the post columns into your group_by list :)
// update: you can do this quickly by writing it as

all_reddits <- all_posts_and_comments %>%
  group_by_at(vars(starts_with('post_'))) %>%
  summarise(...)

summarise_all or summarise_at won't get you far here, as you want to apply two different functions (paste & first) to two different groups of variables (comments & posts) and the two summarise_* functions don't support that.
There's a bunch of suggestions in this thread, but those might be overkill for your use-case...


However, on a side note: you might not be doing yourself a big favor by just concatenating each comment column individually, as you're losing the relation between them in doing so. (imagine one of the comments containing a comma, you wouldn't be able to reverse this summary.)

If you really want to retain the full comment data in your result, you could consider creating only one summary column of all the comments where you store them as a list or nested df.

// update: if you want to do this, you can use the nest function like so:

all_reddits <- all_posts_and_comments %>%
  nest(starts_with('comment_'), .key = 'comments')

If you desire to unnest these comments later, the unnest function can help.

all_posts_and_comments.2 <- all_reddits %>%
  unnest(comments)