【问题标题】:Using praw to scrape a list of subreddits: "TypeError: 'Subreddit' object is not iterable"使用 praw 抓取 subreddits 列表:“TypeError: 'Subreddit' object is not iterable”
【发布时间】:2020-05-14 03:14:40
【问题描述】:

我正在使用 praw 和 Python 3 从 subreddits 列表中抓取帖子和 cmets。该代码以前适用于 1 个 subreddit 以及 [i] 个 subreddit 列表中的 [j] 个搜索词列表。我删除了搜索词列表,只是希望它遍历 subreddit 列表,但我不断收到“TypeError:'Subreddit' object is not iterable。我不明白发生了什么?

subs= ["sub1","sub2", "sub3", "sub4"]

commentsDict = {"comment_user": [], "comment_text":[], "comment_score":[], "comment_date":[] }
postsDict = {"post_title" : [], "post_score" : [], "post_comments_num":[], "post_date":[], \
                "post_user":[], "post_text":[], "post_id":[]}

for i in range(len(subs)):
    for submission in reddit.subreddit(subs[i]):
        submission.comment_sort = 'new'
        comments = list(submission.comments)
        for comments in submission.comments:
            postsDict["post_title"].append(submission.title)#title of post with comment
            postsDict["post_score"].append(submission.score)#upvotes-downvotes
            postsDict["post_text"].append(submission.selftext)#get body of post
            postsDict["post_id"].append(submission.id)#unique id address for post
            postsDict["post_user"].append(submission.author)  #user name of poster
            postsDict["post_comments_num"].append(submission.num_comments) #number of comments on post
            date = submission.created_utc                                  #create variable for date
            timestamp = datetime.datetime.fromtimestamp(date)              #create variable to translate unix date 
            postsDict["post_date"].append(timestamp.strftime('%Y-%m-%D %H:%M:%S')) #extract date and add to dict
            for top_level_comment in submission.comments:                   #create loop for extracting comments
                if isinstance(top_level_comment, MoreComments):
                    continue
            submission.comments.replace_more(limit=None)                   #tell Praw to click more comments and get those too
            commentsDict["comment_user"].append(comments.author)              #get comment username
            commentsDict["comment_score"].append(comments.score)            #comment upvotes-downvotes
            date = comments.created                                         #same date as above but for comments
            timestamp = datetime.datetime.fromtimestamp(date)
            commentsDict["comment_date"].append(timestamp.strftime('%Y-%m-%D %H:%M:%S')) #add translated unix date to dict
            commentsDict["comment_text"].append(comments.body)      #get comment text 

提前感谢您的帮助。

【问题讨论】:

    标签: python-3.x praw


    【解决方案1】:

    您需要在 for 循环中使用 subreddit.stream.submissions() 作为生成器。 例如

    sub = reddit.subreddit(subreddit_name)
    for submissions in sub.stream.submission():
        # Do stuff with submissions
    

    【讨论】:

    • 嗨@tiega,这是不正确的,因为流是为处理新提交的内容而设计的,它们将永远循环(没有break)。这意味着帖子中的代码将永远无法处理超过第一个 subreddit 的内容。
    【解决方案2】:

    首先(与您的问题无关),此循环遍历列表subs 中的索引,然后使用该索引获取项目:

    for i in range(len(subs)):
        for submission in reddit.subreddit(subs[i]):
    

    将其更改为直接迭代子版块:

    for subreddit in subs:
        for submission in reddit.subreddit(subreddit):
    

    现在来修正你的 PRAW 错误:你不能只迭代一个 subreddit (for submission in reddit.subreddit(subreddit))。您必须指定要迭代的列表(例如 new、hot 或 top)。您可以在the PRAW documentation for Subreddit 中查看可用房源列表。这些列表对应于您在网络上查看 subreddit 时看到的各种选项卡:

    例如,使用hot listing:

    for subreddit in subs:
        for submission in reddit.subreddit(subreddit).hot():
    

    如果要指定返回的帖子数量,可以使用limit参数:

    for subreddit in subs:
        for submission in reddit.subreddit(subreddit).hot(limit=5):
    

    上面的代码会给你每个 subreddit 最多 5 次提交。

    你的其余代码做了一些非正统的事情。我在your previous post上评论了其中一个,是这样的:

    comments = list(submission.comments)
    for comments in submission.comments:
    

    您将comments 设置为等于某个值,然后从不使用它,因为它在下一行被重新定义。我会删除 comments = 行,因为它什么都不做。

    此外,对于帖子中的每条评论,您都会遍历帖子中的所有 cmets,但什么也不做:

    for top_level_comment in submission.comments:                   #create loop for extracting comments
        if isinstance(top_level_comment, MoreComments):
            continue
    

    我不知道你想让这段代码做什么,但现在它除了浪费时间之外什么都不做,所以我也会删除它。

    【讨论】:

    • 再次感谢您的帮助 Jarhill0。我在 submit.cmets 中为 top_level_comment 添加了 ```: #create loop for extracting cmets if isinstance(top_level_comment, MoreComments): continue``` 作为最后一个 cmets 问题的结果。我的目标是在给定的时间范围内抓取所有帖子和 cmets。我的理解是,当 Praw 进入 reddit 中的“更多 cmets”选项时(如果我手动浏览,我会单击该部分),它会感到困惑并出错。我很欣赏你的耐心,我对此很陌生。
    • 在这种情况下,您应该在for comment in submission.comments: 正下方添加if isinstance(top_level_comment, MoreComments): continue
    猜你喜欢
    • 2022-12-15
    • 2021-08-08
    • 1970-01-01
    • 1970-01-01
    • 2019-04-12
    • 2018-02-01
    • 2016-02-18
    • 1970-01-01
    • 2022-11-02
    相关资源
    最近更新 更多