【问题标题】:Gevent link crawlerGevent链接爬虫
【发布时间】:2013-10-06 14:37:42
【问题描述】:

在这里,我使用 python 和漂亮的汤编写了代码,以将该页面上的所有链接解析为链接存储库。接下来,它从刚刚创建的存储库中获取任何 url 的内容,将新内容中的链接解析到存储库中,并对存储库中的所有链接继续此过程,直到停止或获取给定数量的链接之后。

但是这段代码很慢。如何通过在 python 中使用 gevents 进行异步编程来改进它?


代码

class Crawler(object):


def __init__(self):

    self.soup = None                                        # Beautiful Soup object
    self.current_page   = "http://www.python.org/"          # Current page's address
    self.links          = set()                             # Queue with every links fetched
    self.visited_links  = set()

    self.counter = 0 # Simple counter for debug purpose

def open(self):

    # Open url
    print self.counter , ":", self.current_page
    res = urllib2.urlopen(self.current_page)
    html_code = res.read()
    self.visited_links.add(self.current_page) 

    # Fetch every links
    self.soup = BeautifulSoup.BeautifulSoup(html_code)

    page_links = []
    try :
        page_links = itertools.ifilter(  # Only deal with absolute links 
                                        lambda href: 'http://' in href,
                                            ( a.get('href') for a in self.soup.findAll('a') )  )
    except Exception as e: # Magnificent exception handling
        print 'Error: ',e
        pass



    # Update links 
    self.links = self.links.union( set(page_links) ) 



    # Choose a random url from non-visited set
    self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
    self.counter+=1


def run(self):

    # Crawl 3 webpages (or stop if all url has been fetched)
    while len(self.visited_links) < 3 or (self.visited_links == self.links):
        self.open()

    for link in self.links:
        print link



if __name__ == '__main__':

C = Crawler()
C.run()

更新 1


import gevent.monkey; gevent.monkey.patch_thread()
from bs4 import BeautifulSoup
import urllib2
import itertools
import random
import urlparse
import sys

import gevent.monkey; gevent.monkey.patch_all(thread=False)




class Crawler(object):


def __init__(self):
self.soup = None                                        # Beautiful Soup object
self.current_page   = "http://www.python.org/"          # Current page's address
self.links          = set()                             # Queue with every links fetched
self.visited_links  = set()

self.counter = 0 # Simple counter for debug purpose

def open(self):

# Open url
print self.counter , ":", self.current_page
res = urllib2.urlopen(self.current_page)
html_code = res.read()
self.visited_links.add(self.current_page)

# Fetch every links
self.soup = BeautifulSoup(html_code)

page_links = []
try :
    for link in [h.get('href') for h in self.soup.find_all('a')]:
        print "Found link: '" + link + "'"
        if link.startswith('http'):
    print 'entered in if link: ',link
            page_links.append(link)
            print "Adding link" + link + "\n"
        elif link.startswith('/'):
    print 'entered in elif link: ',link
            parts = urlparse.urlparse(self.current_page)
            page_links.append(parts.scheme + '://' + parts.netloc + link)
            print "Adding link " + parts.scheme + '://' + parts.netloc + link + "\n"
        else:
    print 'entered in else link: ',link
            page_links.append(self.current_page+link)
            print "Adding link " + self.current_page+link + "\n"

except Exception, ex: # Magnificent exception handling
    print ex

# Update links 
self.links = self.links.union( set(page_links) )

# Choose a random url from non-visited set
self.current_page = random.sample( self.links.difference(self.visited_links),1)[0]
self.counter+=1

def run(self):

# Crawl 3 webpages (or stop if all url has been fetched)
crawling_greenlets = []

for i in range(3):
  crawling_greenlets.append(gevent.spawn(self.open))


gevent.joinall(crawling_greenlets)

#while len(self.visited_links) < 4 or (self.visited_links == self.links):
#    self.open()

for link in self.links:
  print link

if __name__ == '__main__':
C = Crawler()
C.run()

【问题讨论】:

    标签: python asynchronous gevent


    【解决方案1】:

    导入 gevent 并确保完成猴子补丁以使标准库调用无阻塞并了解 gevent:

    import gevent
    from gevent import monkey; monkey.patch_all()
    

    (你可以有选择地决定什么是猴子补丁,但假设它不是 你现在的问题)

    在您的run 中,使您的open 函数在greenlet 中被调用。 run可以 返回 greenlet 对象,因此您可以在需要时等待它 例如使用gevent.joinall 的结果。像这样的:

    def run(self):
        return gevent.spawn(self.open)
    
    c1 = Crawler()
    c2 = Crawler()
    c3 = Crawler()
    crawling_tasks = [c.run() for c in (c1,c2,c3)]
    gevent.joinall(crawling_tasks)
    
    print [c.links for c in (c1, c2, c3)]
    

    【讨论】:

    • 得到这个错误 Exception KeyError: KeyError(15886544,) in 被忽略
    • 尽量不要给线程模块打补丁:monkey.patch_all(thread=False);除了这个错误(顺便说一句,您可能会忽略)之外,它是否按预期工作?
    • 因为你启动了 3 次 same open 函数,所以当执行到达 print (这是它做的第一件事)时,在第一个 greenlet 中你会看到'0: python.org' 输出,并且由于 greenlet 是异步执行的,因此在 urllib 代码中执行传递到下一个,并且由于它也在启动,您会看到完全相同的消息,然后第三个 greenlet 会发生同样的事情.
    • 这是正常行为还是我必须对代码进行一些更改?
    • 这是正常行为。但是,我怀疑你想做同样的事情 3 次?你为什么要做3次?只需执行一次...您可以在run 中返回greenlet 对象,然后在您想等待所有爬取任务完成时将joinall 代码放在其他位置。
    猜你喜欢
    • 2010-11-10
    • 1970-01-01
    • 2016-05-26
    • 2010-12-03
    • 1970-01-01
    • 2013-11-30
    • 2011-06-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多