Scrapy，如何限制每个域的时间？答案

【问题标题】：Scrapy, how to limit time per domain?Scrapy，如何限制每个域的时间？
【发布时间】：2016-10-08 13:06:41
【问题描述】：

我一直在寻找答案，尽管有人问了几个问题，但在这个论坛上没有答案。一个答案是可以在一定时间后停止蜘蛛，但这不适合我，因为我通常每个蜘蛛启动 10 个网站。所以我的挑战是我有 10 个网站的蜘蛛，我想将每个域的时间限制为 20 秒，以避免卡在某些网上商店。怎么做？

总的来说，我也可以告诉你，我爬了 2000 个公司网站，为了在一天内完成，我将这些网站分成 200 个组，每组 10 个网站，我并行启动了 200 个蜘蛛。那可能是业余的，但我是我所知道的最好的。计算机几乎死机了，因为蜘蛛消耗了整个 CPU 和内存，但第二天我就有了结果。我正在寻找的是公司网站上的就业网页。有谁知道如何抓取 2000 个网站？如果网站中有一个网上商店，爬网可能需要几天时间，这就是为什么我想限制每个域的时间。

提前谢谢你。

马尔科

我的代码：

#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8  
import scrapy, urlparse, time, sys
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem

#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
from scrapy.conf import settings
settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#Settings.set(name, value, priority='cmdline')
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}




#start_time = time.time()
# We run the programme in the command line with this command: 

#      scrapy crawl jobs -o urls.csv -t csv --logfile log.txt


# We get two output files
#  1) urls.csv
#  2) log.txt

# Url whitelist.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/url_whitelist.txt", "r+") as kw:
    url_whitelist = kw.read().replace('\n', '').split(",")
url_whitelist = map(str.strip, url_whitelist)

# Tab whitelist.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/tab_whitelist.txt", "r+") as kw:
    tab_whitelist = kw.read().decode(sys.stdin.encoding).encode('utf-8')
tab_whitelist = tab_whitelist.replace('Ŕ', 'č')
tab_whitelist = tab_whitelist.replace('L', 'č')
tab_whitelist = tab_whitelist.replace('Ő', 'š')
tab_whitelist = tab_whitelist.replace('Ü', 'š')
tab_whitelist = tab_whitelist.replace('Ä', 'ž')
tab_whitelist = tab_whitelist.replace('×', 'ž')
tab_whitelist = tab_whitelist.replace('\n', '').split(",")
tab_whitelist = map(str.strip, tab_whitelist)



# Look for occupations in url.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_url.txt", "r+") as occ_url:
    occupations_url = occ_url.read().replace('\n', '').split(",")
occupations_url = map(str.strip, occupations_url)

# Look for occupations in tab.
# We need to replace character the same way as in detector.
with open("Q:/Big_Data/Spletne_strani_podjetij/strganje/kljucne_besede/occupations_tab.txt", "r+") as occ_tab:
    occupations_tab = occ_tab.read().decode(sys.stdin.encoding).encode('utf-8')
occupations_tab = occupations_tab.replace('Ŕ', 'č')
occupations_tab = occupations_tab.replace('L', 'č')
occupations_tab = occupations_tab.replace('Ő', 'š')
occupations_tab = occupations_tab.replace('Ü', 'š')
occupations_tab = occupations_tab.replace('Ä', 'ž')
occupations_tab = occupations_tab.replace('×', 'ž')
occupations_tab = occupations_tab.replace('\n', '').split(",")
occupations_tab = map(str.strip, occupations_tab)

#Join url whitelist and occupations.
url_whitelist_occupations = url_whitelist + occupations_url

#Join tab whitelist and occupations.
tab_whitelist_occupations = tab_whitelist + occupations_tab


#base = open("G:/myVE/vacancies/bazni.txt", "w")
#non_base = open("G:/myVE/vacancies/ne_bazni.txt", "w")


class JobSpider(scrapy.Spider):

    #Name of spider
    name = "jobs"

    #start_urls = open("Q:\Big_Data\Utrip\spletne_strani.txt", "r+").readlines()[0]
    #print urls
    #start_urls = map(str.strip, urls)
    #Start urls
    start_urls = ["http://www.alius.si"]
    print "\nSpletna stran         ", start_urls, "\n"

    #Result of the programme is this list of job vacancies webpages.
    jobs_urls = []


    def parse(self, response):

        #Theoretically I could save the HTML of webpage to be able to check later and see how it looked like
        # at the time of downloading. That is important for validation, because it is easier to look at nice HTML webpage instead of naked text.
        # but I have to write a pipeline http://doc.scrapy.org/en/0.20/topics/item-pipeline.html

        response.selector.remove_namespaces()
        #print "response url" , str(response.url)

        #Take url of response, because we would like to stay on the same domain.
        parsed = urlparse(response.url)

        #Base url.        
        #base_url = get_base_url(response).strip()
        base_url = parsed.scheme+'://'+parsed.netloc
        #print "base url" , str(base_url)
        #If the urls grows from seeds, it's ok, otherwise not.
        if base_url in self.start_urls:
            #print "base url je v start"
            #base.write(response.url+"\n")



            #net1 = parsed.netloc

            #Take all urls, they are marked by "href" or "data-link". These are either webpages on our website either new websites.
            urls_href = response.xpath('//@href').extract()    
            urls_datalink = response.xpath('//@data-link').extract()
            urls = urls_href + urls_datalink
            #print "povezave na tej strani ", urls




            #Loop through all urls on the webpage.
            for url in urls:

                #Test all new urls. NE DELA

                #print "url ", str(url)

                #If url doesn't start with "http", it is relative url, and we add base url to get absolute url.       
                if not (url.startswith("http")):

                    #Povežem delni url z baznim url.
                    url = urljoin(base_url,url).strip()

                #print "new url ", str(url)

                new_parsed = urlparse(url)
                new_base_url = new_parsed.scheme+'://'+new_parsed.netloc
                #print "new base url ", str(new_base_url)

                if new_base_url in self.start_urls:
                    #print "yes"

                    url = url.replace("\r", "")
                    url = url.replace("\n", "")
                    url = url.replace("\t", "")
                    url = url.strip()

                    #Remove anchors '#', that point to a section on the same webpage, because this is the same webpage.
                    #But we keep question marks '?', which mean, that different content is pulled from database.
                    if '#' in url:
                        index = url.find('#')   
                        url = url[:index]
                        if url in self.jobs_urls:
                            continue




                    #Ignore ftp and sftp.
                    if url.startswith("ftp") or url.startswith("sftp"):

                        continue





                    #Compare each url on the webpage with original url, so that spider doesn't wander away on the net.
                    #net2 = urlparse(url).netloc
                    #test.write("lokacija novega url "+ str(net2)+"\n")

                    #if net2 != net1:
                    #    continue
                        #test.write("ni ista lokacija, nadaljujemo\n")

                    #If the last character is slash /, I remove it to avoid duplicates.
                    if url[len(url)-1] == '/':           
                        url = url[:(len(url)-1)]


                    #If url includes characters like %, ~ ... it is LIKELY NOT to be the one I are looking for and I ignore it. 
                    #However in this case I exclude good urls like http://www.mdm.si/company#employment
                    if any(x in url for x in ['%', '~',

                        #slike
                        '.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
                        '.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',

                        #dokumenti
                        '.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd', 
                        '.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD', 

                        #glasba in video
                        '.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
                        '.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',

                        #stiskanje in drugo
                        '.zip', '.rar', '.css', '.flv', '.xml'
                        '.ZIP', '.RAR', '.CSS', '.FLV', '.XML'

                        #Twitter, Facebook, Youtube
                        '://twitter.com', '://mobile.twitter.com', 'www.twitter.com', 
                        'www.facebook.com', 'www.youtube.com'

                        #Feeds, RSS, arhiv
                        '/feed', '=feed', '&feed', 'rss.xml', 'arhiv'


                                ]):

                        continue


                    #We need to save original url for xpath, in case we change it later (join it with base_url)
                    #url_xpath = url                    


                    #We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.         
                    #if (urlparse(url).netloc == urlparse(base_url).netloc):



                    #The main part. We look for webpages, whose urls include one of the employment words as strings.
                    #We will check the tab of the url as well. This is additional filter, suggested by Dan Wu, to improve accuracy. 
                    #tabs = response.xpath('//a[@href="%s"]/text()' % url_xpath).extract()
                    tabs = response.xpath('//a[@href="%s"]/text()' % url).extract()

                    # Sometimes tabs can be just empty spaces like '\t' and '\n' so in this case we replace it with [].
                    # That was the case when the spider didn't find this employment url: http://www.terme-krka.com/si/sl/o-termah-krka/o-podjetju-in-skupini-krka/zaposlitev/
                    tabs = [tab.encode('utf-8') for tab in tabs]
                    tabs = [tab.replace('\t', '') for tab in tabs]
                    tabs = [tab.replace('\n', '') for tab in tabs]
                    tab_empty = True
                    for tab in tabs:
                        if tab != '':
                            tab_empty = False
                    if tab_empty == True:
                        tabs = []


                    # -- Instruction. 
                    # -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
                    # Starting keyword_url is zero, then we add keywords as we find them in url. 
                    keyword_url = ''
                    #for keyword in url_whitelist:
                    for keyword in url_whitelist_occupations:

                        if keyword in url:
                            keyword_url = keyword_url + keyword + ' '
                    # a) If we find at least one keyword in url, we continue.
                    if keyword_url != '':                

                        #1. Tabs are empty.
                        if tabs == []:



                            #We found url that includes one of the magic words and also the text includes a magic word. 
                            #We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
                            if url not in self.jobs_urls :


                                self.jobs_urls.append(url)
                                item = JobItem()
                                item["url"] = url
                                #item["keyword_url"] = keyword_url
                                #item["keyword_url_tab"] = ' '
                                #item["keyword_tab"] = ' '
                                print "Zaposlitvena podstran ", url

                                #We return the item.
                                yield item



                        #2. There are texts in tabs, one or more.
                        else:

                            #For the same partial url several texts are possible.
                            for tab in tabs:                            

                                #We search for keywords in tabs.
                                keyword_url_tab = ''
                                #for key in tab_whitelist:
                                for key in tab_whitelist_occupations:

                                    if key in tab:
                                        keyword_url_tab = keyword_url_tab + key + ' '

                                # If we find some keywords in tabs, then we have found keywords in both url and tab and we can save the url.
                                if keyword_url_tab != '':

                                    # keyword_url_tab starts with keyword_url from before, because we want to remember keywords from both url and tab. So we add initial keyword_url.
                                    keyword_url_tab = 'URL ' + keyword_url + ' TAB ' + keyword_url_tab

                                    #We found url that includes one of the magic words and also the tab includes a magic word. 
                                    #We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
                                    if url not in self.jobs_urls:                             

                                        self.jobs_urls.append(url)
                                        item = JobItem()
                                        item["url"] = url
                                        #item["keyword_url"] = ' '
                                        #item["keyword_url_tab"] = keyword_url_tab
                                        #item["keyword_tab"] = ' '
                                        print "Zaposlitvena podstran ", url

                                        #We return the item.
                                        yield item

                                #We haven't found any keywords in tabs, but url is still good, because it contains some keywords, so we save it.
                                else:

                                    if url not in self.jobs_urls:                             

                                        self.jobs_urls.append(url)
                                        item = JobItem()
                                        item["url"] = url
                                        #item["keyword_url"] = keyword_url
                                        #item["keyword_url_tab"] = ' '
                                        #item["keyword_tab"] = ' '
                                        print "Zaposlitvena podstran ", url

                                        #We return the item.
                                        yield item                            

                    # b) If keyword_url = empty, there are no keywords in url, but perhaps there are keywords in tabs. So we check tabs.
                    else:
                        for tab in tabs:


                            keyword_tab = ''
                            #for key in tab_whitelist:
                            for key in tab_whitelist_occupations:


                                if key in tab:
                                    keyword_tab = keyword_tab + key + ' '
                            if keyword_tab != '':                           

                                if url not in self.jobs_urls:                             

                                    self.jobs_urls.append(url)
                                    item = JobItem()
                                    item["url"] = url
                                    #item["keyword_url"] = ' '
                                    #item["keyword_url_tab"] = ' '
                                    #item["keyword_tab"] = keyword_tab
                                    print "Zaposlitvena podstran ", url

                                    #We return the item.
                                    yield item                  

                    #We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
                    #We keep looking for employment webpages, until we reach the DEPTH set in settings.py. 
                    yield Request(url, callback = self.parse)

            #else:
                #non_base.write(response.url+"\n")

【问题讨论】：

在特定的“时间”停止整个蜘蛛与按域停止它有什么不同？
@eLRuLL 不同之处在于蜘蛛抓取了十个网站，如果我在 200 秒后停止它，我无法确定每个网站都有 20 秒。可能是一个网站一直在消耗，而其他网站则被抛在后面。而且我不知道进程以及机器内部如何处理请求。

标签： python time scrapy

【解决方案1】：

只需使用scrapyd 即可安排 2000 次单个网站抓取。设置 max_proc = 10 [1] 以并行运行 10 个蜘蛛。将蜘蛛的 CLOSESPIDER_TIMEOUT [2] 设置为 20 运行每个蜘蛛 20 秒。停止本机使用 Windows，因为它很痛苦。我观察到 Scrapy 和 scrapyd 在 VM 内部运行得更快，而不是在 Windows 上本地运行。我可能是错的 - 所以尝试自己进行交叉检查，但我有一种强烈的感觉，如果你在 Windows 上使用Ubuntu 14.04 virtualbox image，它会更快。您的抓取将需要 2000 * 20 / 10 = 17 分钟。

【讨论】：

谢谢@neverlastn，看起来不错。如果我能设法管理它，因为我的技术不那么先进。我只是在工作中开发一些应用程序，我不太喜欢它。要是跑得快一点就好了。这是我开始的项目，现在不知何故我正在承受它。
@Marko - 很高兴！尝试一段时间，如果它不起作用，请告诉我。我很确定这个解决方案有效:)
是的，我在国家统计局工作，我们与计算机科学学院有联系，该学院正在帮助我们使用基于 Python 的机器学习工具 Orange。所以既然他们是 Python 程序员，我可能会向他们寻求帮助 :)