如何使用 Tweepy 收集来自多个用户的多条推文？答案

【问题标题】：How do I collect multiple tweets from multiple users with Tweepy?如何使用 Tweepy 收集来自多个用户的多条推文？
【发布时间】：2021-01-14 04:42:43
【问题描述】：

我知道有人问过类似的问题，但我正在工作的项目是使用 Tweepy for Python，所以它更具体一点。

我正在从可口可乐和百事可乐的追随者那里收集一千个用户 ID，然后搜索每个用户最近的 20 个状态以收集使用的主题标签。

我正在使用 Tweepy follower_ids 和 user_timeline API，但我不断从 Twitter 获得 401。如果我将要搜索的用户 ID 的数量设置为仅 10 个，而不是 1000 个，我有时会得到我想要的结果，但即便如此我有时也会得到 401 个。所以它起作用了.... 有点。似乎是导致这些错误的大集合，我不知道如何绕过它们。

我知道 Twitter 对调用有限制，但如果我能够相当即时地获取 1000 个用户 ID，为什么我不能获取状态？我意识到我正在尝试获得 20,000 个状态，但我只尝试了 100*20 甚至 50*20 并且仍然获得 401。我已经多次重置我的系统时钟，但这只偶尔适用于 10*20 设置。我希望那里的某个人可能有比我目前拥有的更好、更有效的方法来做到这一点。我是 Twitter API 的新手，对 Python 也很陌生，所以希望只有我自己。

代码如下：

import tweepy
import pandas as pd

consumer_key = 'REDACTED'
consumer_secret = 'REDACTED'
access_token = 'REDACTED'
access_token_secret = 'REDACTED'

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.secure = True
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

pepsiUsers = []
cokeUsers = []
cur_pepsiUsers = tweepy.Cursor(api.followers_ids, screen_name='pepsi')
cur_cokeUsers = tweepy.Cursor(api.followers_ids, screen_name='CocaCola')

for user in cur_pepsiUsers.items(1000):
    pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
    for status in tweepy.Cursor(api.user_timeline, user).items(20):
        status = status._json
        hashtags = status['entities']['hashtags']
        index = len(pepsiUsers) - 1
        if len(hashtags) > 1:
            for ht in hashtags:
                pepsiUsers[index]['hTags'].append(ht['text'])

for user in cur_cokeUsers.items(1000):
    cokeUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Coke' })
    for status in tweepy.Cursor(api.user_timeline, user).items(20):
        status = status._json
        hashtags = status['entities']['hashtags']
        index = len(cokeUsers) - 1
        if len(hashtags) > 1:
            for ht in hashtags:
                cokeUsers[index]['hTags'].append(ht['text'])

"""create a master list of coke and pepsi users to write to CSV"""
mergedList = cokeUsers + pepsiUsers
"""here we'll turn empty hashtag lists into blanks and turn all hashtags for each user into a single string
    for easier searching with R later"""
for i in mergedList:
    if len(i['hTags']) == 0:
        i['hTags'] = ''
    i['hTags'] = ''.join(i['hTags'])

list_df = pd.DataFrame(mergedList, columns=['userId', 'favSoda', 'hTags'])
list_df.to_csv('test.csv', index=False)

这是我在尝试运行那些运行 api.user_timeline 代码的块时遇到的错误

---------------------------------------------------------------------------
TweepError                                Traceback (most recent call last)
<ipython-input-134-a7658ed899f3> in <module>()
      3 for user in cur_pepsiUsers.items(1000):
      4     pepsiUsers.append({ 'userId': user, 'hTags': [], 'favSoda': 'Pepsi' })
----> 5     for status in tweepy.Cursor(api.user_timeline, user).items(20):
      6         status = status._json
      7         hashtags = status['entities']['hashtags']

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in __next__(self)
     47 
     48     def __next__(self):
---> 49         return self.next()
     50 
     51     def next(self):

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
    195         if self.current_page is None or self.page_index == len(self.current_page) - 1:
    196             # Reached end of current page, get the next page...
--> 197             self.current_page = self.page_iterator.next()
    198             self.page_index = -1
    199         self.page_index += 1

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/cursor.py in next(self)
    106 
    107         if self.index >= len(self.results) - 1:
--> 108             data = self.method(max_id=self.max_id, parser=RawParser(), *self.args, **self.kargs)
    109 
    110             if hasattr(self.method, '__self__'):

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in _call(*args, **kwargs)
    243             return method
    244         else:
--> 245             return method.execute()
    246 
    247     # Set pagination mode

/Users/.../anaconda/lib/python3.5/site-packages/tweepy/binder.py in execute(self)
    227                     raise RateLimitError(error_msg, resp)
    228                 else:
--> 229                     raise TweepError(error_msg, resp, api_code=api_error_code)
    230 
    231             # Parse the response payload

TweepError: Twitter error response: status code = 401

【问题讨论】：

以401 HTTP status 的含义开头。您的代码可能没有问题；你只需要生活在 Twitter 规定的限制范围内。
是的，我了解 401 是什么，但我也知道 Twitter API 可能出于多种原因返回 401——这里可能发生的情况是（这是我未受教育的猜测）当这么多电话需要很长时间才能拨打时，会发生时钟蠕变，拨打电话的服务器不再与 Twitter 同步。我知道 Tweepy 提供分页功能，但我不完全了解如何在这里使用它。
我建议创建一个列表，将用户添加到列表中，然后简单地阅读列表中的推文。
我做了类似的事情，我首先创建了一个用户 ID 列表，然后逐步遍历 ID 列表以获取他们的推文，但同样，如果我试图获得更多，比如说来自 2 -5 个用户的 20 条推文（我认为 user_timeline 总是返回 20 条推文——我已经在光标中设置了计数运算符，但它仍然给我 20 条）它超时为 401
@visarts 我有一个类似的问题（链接附在下面），我想知道你是否能够解决这个问题？ stackoverflow.com/questions/63747136/…

标签： python twitter tweepy

【解决方案1】：

您只需要 Twitter JSON 吗？由于您的采集区域范围，不妨试试twarc：https://github.com/edsu/twarc

【讨论】：

【解决方案2】：

在创建 API 时尝试添加速率限制。

api = tweepy.API(auth, wait_on_rate_limit=True, wait_on_rate_limit_notify=True, 
retry_count=5, retry_delay=15)

如果这不能完全解决问题，请在 python 中使用（尝试和异常）来捕获错误并等待大约 15 分钟后再返回。

【讨论】：