【问题标题】:Optimizing python script with multithreading [closed]使用多线程优化python脚本[关闭]
【发布时间】:2012-05-23 14:50:22
【问题描述】:

大家好!我已经编写了小型网页爬虫功能。但我是多线程的新手,我无法优化它。我的代码是:

alreadySeenURLs = dict() #the dictionary of already seen crawlers
candidates = set() #the set of URL candidates to crawl

def initializeCandidates(url):

    #gets page with urllib2
    page = getPage(url)

    #parses page with BeautifulSoup
    parsedPage = getParsedPage(page)

    #function which return all links from parsed page as set
    initialURLsFromRoot = getLinksFromParsedPage(parsedPage)

    return initialURLsFromRoot 

def updateCandidates(oldCandidates, newCandidates):
    return oldCandidates.union(newCandidates)

candidates = initializeCandidates(rootURL)

for url in candidates:

    print len(candidates)

    #fingerprint of URL
    fp = hashlib.sha1(url).hexdigest()

    #checking whether url is in alreadySeenURLs
    if fp in alreadySeenURLs:
        continue

    alreadySeenURLs[fp] = url

    #do some processing
    print url

    page = getPage(url)
    parsedPage = getParsedPage(page, fix=True)
    newCandidates = getLinksFromParsedPage(parsedPage)

    candidates = updateCandidates(candidates, newCandidates)

可以看出,这里它在特定时间从候选人那里获取一个网址。我想让这个脚本成为多线程的,以这样一种方式,它可以从候选人那里获取至少 N 个 url,并完成这项工作。谁能指导我?提供任何链接或建议?

【问题讨论】:

标签: python multithreading web-crawler web-scraping python-multithreading


【解决方案1】:

你可以从这两个链接开始:

  1. Python中线程的基本参考 http://docs.python.org/library/threading.html

  2. 他们在 python 中实际实现多线程 URL 爬虫的教程 http://www.ibm.com/developerworks/aix/library/au-threadingpython/

另外,你已经有一个python爬虫:http://scrapy.org/

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-10-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-12-02
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多