1
votes

Trying to create a PRAW scraper that can pull the comments from a list of sub_ids. Only returns the last sub_ids comment data.

I'm guessing I must be overwriting something. I've looked through other questions but because I'm using PRAW it has specific parameters and I can't figure out what could/should be replaced.

sub_ids = ["2ypash", "7ocvlb", "7okxkf"]

for sub_id in sub_ids:

    submission = reddit.submission(id=sub_id)

    submission.comments.replace_more(limit=None, threshold=0)

comments = submission.comments.list()

commentlist = []
for comment in comments:

    commentsdata = {}
    commentsdata["id"] = comment.id
    commentsdata["subreddit"] = str(submission.subreddit)
    commentsdata["thread"] = str(submission.title)
    commentsdata["author"] = str(comment.author)
    commentsdata["body"] = str(comment.body)
    commentsdata["score"] = comment.score
    commentsdata["created_utc"] = datetime.datetime.fromtimestamp(comment.created_utc)
    commentsdata["parent_id"] = comment.parent_id

    commentlist.append(commentsdata)
1

1 Answers

0
votes

Indentation was your downfall. The reason your code was failing was because comments were only assigned after the sub_ids have finished looping. So when you iterate through comments, they're only the last sub_id's comments.

First, move the commentlist = [] out before both for loops (so that it's right after line 1)

Second, everything from comments = submission.comments.list() (inclusive) onward needs to be indented so they're ran within the sub_ids iteration.

Here is what it should look like finally:

sub_ids = ["2ypash", "7ocvlb", "7okxkf"]
commentlist = []

for sub_id in sub_ids:

    submission = reddit.submission(id=sub_id)
    submission.comments.replace_more(limit=None, threshold=0)
    comments = submission.comments.list()

    for comment in comments:

        commentsdata = {}
        commentsdata["id"] = comment.id
        commentsdata["subreddit"] = str(submission.subreddit)
        commentsdata["thread"] = str(submission.title)
        commentsdata["author"] = str(comment.author)
        commentsdata["body"] = str(comment.body)
        commentsdata["score"] = comment.score
        commentsdata["created_utc"] = datetime.datetime.fromtimestamp(comment.created_utc)
        commentsdata["parent_id"] = comment.parent_id

        commentlist.append(commentsdata)