【发布时间】:2019-09-05 15:35:13
【问题描述】:
我有一个相当简单的蜘蛛,它从文件加载 URL(工作),然后应该开始抓取并存档 HTML 响应。
它以前工作得很好,几天后,我无法弄清楚我做了什么改变使它停止工作。 现在,蜘蛛只抓取每个 URL 的第一页,然后停止:
'finish_reason': 'finished',
蜘蛛:
class TesterSpider(CrawlSpider):
name = 'tester'
allowed_domains = []
rules = (
Rule(LinkExtractor(allow=(), deny=(r'.*Zahlung.*', r'.*Cookies.*', r'.*Login.*', r'.*Datenschutz.*', r'.*Registrieren.*', r'.*Kontaktformular.*', )),callback='parse_item'),
)
def __init__(self, *a, **kw):
super(CrawlSpider, self).__init__(*a, **kw)
def start_requests(self):
logging.log(logging.INFO, "======== Starting with start_requests")
self._compile_rules()
smgt = Sourcemanagement()
rootdir = smgt.get_root_dir()
file_list = smgt.list_all_files ( rootdir + "/sources" )
links = smgt.get_all_domains()
links = list(set(links))
request_list = []
for link in links:
o = urlparse(link)
result = '{uri.netloc}'.format(uri=o)
self.allowed_domains.append(result)
request_list.append ( Request(url=link, callback=self.parse_item) )
return ( request_list )
def parse_item(self, response):
item = {}
self.write_html_file ( response )
return item
还有设置:
BOT_NAME = 'crawlerscrapy'
SPIDER_MODULES = ['crawlerscrapy.spiders']
NEWSPIDER_MODULE = 'crawlerscrapy.spiders'
USER_AGENT_LIST = "useragents.txt"
ROBOTSTXT_OBEY = True
CONCURRENT_REQUESTS = 150
DOWNLOAD_DELAY = 43
CONCURRENT_REQUESTS_PER_DOMAIN = 1
COOKIES_ENABLED = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding':'gzip, deflate, sdch',
'Connection':'keep-alive',
'Cache-Control':'max-age=0',
'Accept-Language': 'de',
}
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
'random_useragent.RandomUserAgentMiddleware': 400
}
AUTOTHROTTLE_ENABLED = False
SCHEDULER_PRIORITY_QUEUE = 'scrapy.pqueues.DownloaderAwarePriorityQueue'
REACTOR_THREADPOOL_MAXSIZE = 20
LOG_LEVEL = 'DEBUG'
DEPTH_LIMIT = 0
DOWNLOAD_TIMEOUT = 15
DEPTH_PRIORITY = 1
SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleFifoDiskQueue'
SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.FifoMemoryQueue'
任何想法,我做错了什么?
编辑:
我找到了答案:
request_list.append ( Request(url=link, callback=self.parse_item) )
# to be replaced by:
request_list.append ( Request(url=link, callback=self.parse) )
但我真的不明白为什么。
https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.parse
所以我可以在parse_item 中返回一个空字典,但我不应该这样做,因为它会破坏事情的流程?
【问题讨论】:
-
您可以使用
ipdb进行调试。将ipdb.set_trace()放入需要的地方并运行项目。 -
谢谢。幸运的是,我找到了原因——但并不完全明白为什么。我已将其添加到问题中
标签: scrapy web-crawler