使用 praw 抓取 subreddits 列表：“TypeError: 'Subreddit' object is not iterable”答案

【问题标题】：Using praw to scrape a list of subreddits: "TypeError: 'Subreddit' object is not iterable"使用 praw 抓取 subreddits 列表：“TypeError: 'Subreddit' object is not iterable”
【发布时间】：2020-05-14 03:14:40
【问题描述】：

我正在使用 praw 和 Python 3 从 subreddits 列表中抓取帖子和 cmets。该代码以前适用于 1 个 subreddit 以及 [i] 个 subreddit 列表中的 [j] 个搜索词列表。我删除了搜索词列表，只是希望它遍历 subreddit 列表，但我不断收到“TypeError：'Subreddit' object is not iterable。我不明白发生了什么？

subs= ["sub1","sub2", "sub3", "sub4"]

commentsDict = {"comment_user": [], "comment_text":[], "comment_score":[], "comment_date":[] }
postsDict = {"post_title" : [], "post_score" : [], "post_comments_num":[], "post_date":[], \
                "post_user":[], "post_text":[], "post_id":[]}

for i in range(len(subs)):
    for submission in reddit.subreddit(subs[i]):
        submission.comment_sort = 'new'
        comments = list(submission.comments)
        for comments in submission.comments:
            postsDict["post_title"].append(submission.title)#title of post with comment
            postsDict["post_score"].append(submission.score)#upvotes-downvotes
            postsDict["post_text"].append(submission.selftext)#get body of post
            postsDict["post_id"].append(submission.id)#unique id address for post
            postsDict["post_user"].append(submission.author)  #user name of poster
            postsDict["post_comments_num"].append(submission.num_comments) #number of comments on post
            date = submission.created_utc                                  #create variable for date
            timestamp = datetime.datetime.fromtimestamp(date)              #create variable to translate unix date 
            postsDict["post_date"].append(timestamp.strftime('%Y-%m-%D %H:%M:%S')) #extract date and add to dict
            for top_level_comment in submission.comments:                   #create loop for extracting comments
                if isinstance(top_level_comment, MoreComments):
                    continue
            submission.comments.replace_more(limit=None)                   #tell Praw to click more comments and get those too
            commentsDict["comment_user"].append(comments.author)              #get comment username
            commentsDict["comment_score"].append(comments.score)            #comment upvotes-downvotes
            date = comments.created                                         #same date as above but for comments
            timestamp = datetime.datetime.fromtimestamp(date)
            commentsDict["comment_date"].append(timestamp.strftime('%Y-%m-%D %H:%M:%S')) #add translated unix date to dict
            commentsDict["comment_text"].append(comments.body)      #get comment text

提前感谢您的帮助。

【问题讨论】：

标签： python-3.x praw

【解决方案1】：

您需要在 for 循环中使用 subreddit.stream.submissions() 作为生成器。例如

sub = reddit.subreddit(subreddit_name)
for submissions in sub.stream.submission():
    # Do stuff with submissions

【讨论】：

嗨@tiega，这是不正确的，因为流是为处理新提交的内容而设计的，它们将永远循环（没有break）。这意味着帖子中的代码将永远无法处理超过第一个 subreddit 的内容。

【解决方案2】：

首先（与您的问题无关），此循环遍历列表subs 中的索引，然后使用该索引获取项目：

for i in range(len(subs)):
    for submission in reddit.subreddit(subs[i]):

将其更改为直接迭代子版块：

for subreddit in subs:
    for submission in reddit.subreddit(subreddit):

现在来修正你的 PRAW 错误：你不能只迭代一个 subreddit (for submission in reddit.subreddit(subreddit))。您必须指定要迭代的列表（例如 new、hot 或 top）。您可以在the PRAW documentation for Subreddit 中查看可用房源列表。这些列表对应于您在网络上查看 subreddit 时看到的各种选项卡：

例如，使用hot listing:

for subreddit in subs:
    for submission in reddit.subreddit(subreddit).hot():

如果要指定返回的帖子数量，可以使用limit参数：

for subreddit in subs:
    for submission in reddit.subreddit(subreddit).hot(limit=5):

上面的代码会给你每个 subreddit 最多 5 次提交。

你的其余代码做了一些非正统的事情。我在your previous post上评论了其中一个，是这样的：

comments = list(submission.comments)
for comments in submission.comments:

您将comments 设置为等于某个值，然后从不使用它，因为它在下一行被重新定义。我会删除 comments = 行，因为它什么都不做。

此外，对于帖子中的每条评论，您都会遍历帖子中的所有 cmets，但什么也不做：

for top_level_comment in submission.comments:                   #create loop for extracting comments
    if isinstance(top_level_comment, MoreComments):
        continue

我不知道你想让这段代码做什么，但现在它除了浪费时间之外什么都不做，所以我也会删除它。

【讨论】：

再次感谢您的帮助 Jarhill0。我在 submit.cmets 中为 top_level_comment 添加了 ```: #create loop for extracting cmets if isinstance(top_level_comment, MoreComments): continue``` 作为最后一个 cmets 问题的结果。我的目标是在给定的时间范围内抓取所有帖子和 cmets。我的理解是，当 Praw 进入 reddit 中的“更多 cmets”选项时（如果我手动浏览，我会单击该部分），它会感到困惑并出错。我很欣赏你的耐心，我对此很陌生。
在这种情况下，您应该在for comment in submission.comments: 正下方添加if isinstance(top_level_comment, MoreComments): continue