【问题标题】:Seeking help to improve a crawler寻求帮助以改进爬虫
【发布时间】:2016-10-03 10:33:17
【问题描述】:

我是 Scrapy/Python 的初学者,我开发了一个可以找到过期域并在 SEO API 上扫描每个域的爬虫.
我的爬虫工作正常,但我很确定爬虫不是 100% 优化的。

请问可以有一些技巧来改进Crawler吗?

expired.py:

class HttpbinSpider(CrawlSpider):
    name = "expired"

    rules = (
        Rule(LxmlLinkExtractor(allow=('.com', '.fr', '.net', '.org', '.info', '.casino', '.eu'),
                               deny=('facebook', 'amazon', 'wordpress', 'blogspot', 'free', 'reddit'),
             callback='parse_obj',
             process_request='add_errback',
             follow=True),
    )

    def __init__(self, domains=None, **kwargs):
        self.start_urls = json.loads(domains)
        super(HttpbinSpider, self).__init__()

    def add_errback(self, request):
        return request.replace(errback=self.errback_httpbin)

    def errback_httpbin(self, failure):
        if failure.check(DNSLookupError):
            request = failure.request
            ext = tldextract.extract(request.url)
            domain = ext.registered_domain
            if domain != '':
                domain = domain.replace("%20", "")
                self.check_domain(domain)

    def check_domain(self, domain):
        if self.is_available(domain) == 'AVAILABLE':

            self.logger.info('## Domain Expired : %s', domain)

            url = 'http://api.majestic.com/api/json?app_api_key=API&cmd=GetIndexItemInfo&items=1&item0=' + domain + '&datasource=fresh'
            response = urllib.urlopen(url)
            data = json.loads(response.read())
            response.close()

            TrustFlow = data['DataTables']['Results']['Data'][0]['TrustFlow']
            CitationFlow = data['DataTables']['Results']['Data'][0]['CitationFlow']
            RefDomains = data['DataTables']['Results']['Data'][0]['RefDomains']
            ExtBackLinks = data['DataTables']['Results']['Data'][0]['ExtBackLinks']

            if (RefDomains > 20) and (TrustFlow > 4) and (CitationFlow > 4):
                insert_table(domain, TrustFlow, CitationFlow, RefDomains, ExtBackLinks)

    def is_available(self, domain):
        url = 'https://api.internet.bs/Domain/Check?ApiKey=KEY&Password=PSWD&responseformat=json&domain' + domain
        response = urllib.urlopen(url)
        data = json.loads(response.read())
        response.close()
        return data['status']

非常感谢。

【问题讨论】:

    标签: python python-2.7 web-scraping scrapy


    【解决方案1】:

    您的代码中最大的问题是 urllib 请求阻塞了整个异步 scrapy 例程。您可以通过生成 scrapy.Request 轻松地用 scrapy 请求链替换那些。

    类似这样的:

    def errback_httpbin(self, failure):
        if not failure.check(DNSLookupError):
            return
        request = failure.request
        ext = tldextract.extract(request.url)
        domain = ext.registered_domain
        if domain == '':
            logging.debug('no domain: {}'.format(request.url))
            return
        domain = domain.replace("%20", "")
        url = 'https://api.internet.bs/Domain/Check?ApiKey=KEY&Password=PSWD&responseformat=json&domain=' + domain
        return Request(url, self.parse_checkdomain)
    
    def parse_checkdomain(self, response):
        """check whether domain is available"""
        data = json.loads(response.read())
        if data['status'] == 'AVAILABLE':
            self.logger.info('Domain Expired : {}'.format(data['domain']))
            url = 'http://api.majestic.com/api/json?app_api_key=API&cmd=GetIndexItemInfo&items=1&item0=' + data['domain']+ '&datasource=fresh'
            return Request(url, self.parse_claim)
    
    def parse_claim(self, response):
        """save available domain's details"""
        data = json.loads(response.read())
        # eliminate redundancy
        results = data['DataTables']['Results']['Data'][0]
        # snake case is more pythonic
        trust_flow = results['TrustFlow']
        citation_flow = results['CitationFlow']
        ref_domains = results['RefDomains']
        ext_back_links = results['ExtBackLinks']
    
        # don't need to wrap everything in ()
        if ref_domains > 20 and trust_flow > 4 and citation_flow > 4:
            insert_table(domain, trust_flow, citation_flow, ref_domains, ext_back_links)
    

    这样您的代码不会被阻塞并且是完全异步的。通常,在您的scrapy spider 中处理http 时,您除了scrapy 请求之外什么都不想使用。

    【讨论】:

    • 非常感谢您的帮助和改进代码。我会试试的!
    • 我正在使用 BloomFilter,我在日志 raise IndexError("BloomFilter is at capacity") 中有很多错误。你知道为什么吗?
    • @Pixel 抱歉,我不太熟悉它。 AFAIK 根据pybloom wiki 它应该打开一个新的更大的实例。您可以在创建过滤器时尝试指定容量。我认为这与scrapy无关,您可能想为此提出一个新问题:)
    • @Pixel 抱歉,我不在 Skype 上,但如果您有任何问题,可以在 #python 上以 tinargirc.freenode.orgirc.freenode.org 找到我