Twitter Python爬虫的爬取机制问题答案

【问题标题】：Problem with Crawling Mechanism of Twitter Python CrawlerTwitter Python爬虫的爬取机制问题
【发布时间】：2012-04-16 12:25:37
【问题描述】：

下面是我的 twitter 爬虫机制的一小段代码：

from BeautifulSoup import BeautifulSoup
import re
import urllib2

url = 'http://mobile.twitter.com/NYTimesKrugman'

def gettweets(soup):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        print tag.renderContents()
        print ('\n\n')

def are_more_tweets(soup):#to check whether there is more than one page on mobile   twitter 
    links = soup.findAll('a', {'href': True}, {id: 'more_link'})
    for link in links:
        b = link.renderContents()
        test_b = str(b)
        if test_b.find('more'):
            return True
        else:
            return False

def getnewlink(soup): #to get the link to go to the next page of tweets on twitter 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
    times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        test_stamp = str(stamp)
        if test_stamp == '3 months ago':  
            print test_stamp
            return True
        else:
            return False


response = urllib2.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
print 'stamp' + str(stamp)
print 'tweets' +str (tweets)
while (stamp is False) and (tweets is True): 
    b = getnewlink(soup)
    print b
    red = urllib2.urlopen(b)
    html = red.read()
    soup = BeautifulSoup(html)
    gettweets(soup)
    stamp = checkforstamp(soup)
    tweets = are_more_tweets(soup)
print 'done'

问题是，在我的推特爬虫点击大约 3 个月的推文后，我希望它停止转到用户的下一页。但是，它似乎没有这样做。它似乎在不断地搜索下一页的推文。我相信这是由于 checkstamp 不断评估为 False 的事实。有没有人对我如何修改代码有任何建议，以便爬虫只要有更多推文（由 are_more_tweets 机制验证）并且还没有达到 3 个月的推文，就可以继续寻找下一页的推文？ ??谢谢！

编辑 - 请参见下文：

from BeautifulSoup import BeautifulSoup
import re
import urllib

url = 'http://mobile.twitter.com/cleversallie'
output = open(r'C:\Python28\testrecursion.txt', 'a') 

def gettweets(soup):
    tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
    for tag in tags: 
        a = tag.renderContents()
        b = str (a)
        print(b)
        print('\n\n')

def are_more_tweets(soup):#to check whether there is more than one page on mobile twitter 
    links = soup.findAll('a', {'href': True}, {id: 'more_link'})
    for link in links:
        b = link.renderContents()
        test_b = str(b)
        if test_b.find('more'):
            return True
        else:
            return False

def getnewlink(soup): #to get the link to go to the next page of tweets on twitter 
    links = soup.findAll('a', {'href': True}, {id : 'more_link'})
    for link in links:
        b = link.renderContents()
        if str(b) == 'more':
            c = link['href']
            d = 'http://mobile.twitter.com' +c
            return d

 def checkforstamp(soup): # the parser scans a webpage to check if any of the tweets are older than 3 months
    times = soup.findAll('a', {'href': True}, {'class': 'status_link'})
    for time in times:
        stamp = time.renderContents()
        test_stamp = str(stamp)
        if not (test_stamp[0]) in '0123456789':
            continue
        if test_stamp == '3 months ago':
            print test_stamp
            return True
        else:
            return False


response = urllib.urlopen(url)
html = response.read()
soup = BeautifulSoup(html)
gettweets(soup)
stamp = checkforstamp(soup)
tweets = are_more_tweets(soup)
while (not stamp) and (tweets): 
    b = getnewlink(soup)
    print b
    red = urllib.urlopen(b)
    html = red.read()
    soup = BeautifulSoup(html)
    gettweets(soup)
    stamp = checkforstamp(soup)
    tweets = are_more_tweets(soup)
 print 'done'

【问题讨论】：

正如我在您上一个问题中所说，print test_stamp 之前 if 声明显示了什么？它是否曾经显示3 months ago 或类似的东西？另外，由于我正确回答了您的最后一个问题，请接受该答案。
嗨，我很抱歉。我认为上一篇文章有点混乱，所以我想在一个新问题中澄清一下。你最后的回答确实回答了我的问题。但是，它也暴露了我的代码的另一个缺陷，所以我认为这将有助于创建一个新帖子。
呃。没有足够的空间发表评论。您是否建议我将 print test_stamp 移到 if 语句之前，看看它是否能找到 3 个月前所说的内容？我假设是这样 - 所以我正在尝试测试它。
好的，所以我按照你的建议做了 - 并找到了这个。这是我的程序正在使用的假定 test_stamp 的示例：a2.twimg.com/twitter-mobile/…" />
开始一个新问题没有错，这是你应该做的。答案如下。

标签： python html twitter web-crawler

【解决方案1】：

您的soup.findall() 正在获取与您的模式匹配的链接中的图像标签（具有href 属性和class status-link）。

不要总是在第一个链接上returning，试试：

for time in times:
    stamp = time.renderContents()
    test_stamp = str(stamp)
    print test_stamp
    if not test_stamp[0] in '0123456789':
        continue
    if test_stamp == '3 months ago':  
        return True
    else:
        return False

如果链接不以数字开头，它将跳过链接，因此您实际上可能会找到正确的链接。将print 声明保留在其中，这样您就可以查看是否点击了以数字开头的其他类型的链接，您还需要过滤掉该链接。

编辑：您所做的是总是返回在times中的第一个项。我对其进行了更改，使其忽略任何不以数字开头的链接。

但是，如果它没有找到带有数字的任何链接，这将导致它返回None。这可以正常工作，除非您将 while not stamp and tweets 更改为 while stamp is False and tweets is True。将其改回while not stamp and tweets，它将正确地将None和False视为相同，并且应该可以工作。

【讨论】：

我添加了这段代码并发现了一些东西。因为推文中的时间戳更改为“大约一个月前”，所以我只是在 1 个月而不是 3 个月时停止接收推文。我相信这是由于（如果不是 test_stamp[0]..）位。至少我收到的推文不止一页……但是，我没有获得足够的数据。为您提供更多信息……我最终想要做的是为单个用户获取 2 月（任何时间）之后大约 3 个月的推文和 2011 年 2 月之前大约 3 个月的推文。你能提出什么建议吗？
嗨。当我与用户 NYTimesKrugman 一起测试该程序时，它似乎运行得非常完美。但是，当我与另一个 twitter 用户cleversallie 一起测试该程序时，我收到了以下错误消息：
Traceback（最近一次调用最后一次）：文件“C:/Users/Public/Documents/Columbia Job/Python Crawler/Twitter Crawler/testingrecursionofsamefollower1.py”，第 60 行，在 red = urllib2.urlopen(b) 文件“C:\Python28\lib\urllib2.py”，第 126 行，在 urlopen 返回 _opener.open(url, data, timeout) 文件“C:\Python28\lib\urllib2.py”，第 385 行，在 open req.timeout = timeout AttributeError: 'NoneType' object has no attribute 'timeout'
糟糕，我是堆栈溢出的新手，没有意识到我应该接受答案...对不起！
谢谢:)。这是您的 Python 版本的一个已知错误。最简单的可能是尝试import urllib 而不是import urllib2 并使用urllib.urlopen。如果这不起作用，您需要升级/降级到不同版本的 python。