【发布时间】:2017-12-28 10:22:39
【问题描述】:
我想抓取具有这种格式 url 的网站:
- www.test.com/category1/123456.html(页面)
- www.test.com/category1/123457.html ..
- www.test.com/category2
- www.test.com/category3 ...
代码如下:
class ExampleSpider(CrawlSpider):
name = "test" # Spider name
allowed_domains = ["test.com"] # Which (sub-)domains shall be scraped?
start_urls = ["https://test.com/"] # Start with this one
user_agent=["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"]
rules = [Rule(LinkExtractor(allow=(r'/[a-z-]+/[0-9]+\.html$')), callback='parse_item', follow=True)]
# Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
selector = Selector(response)
title = selector.xpath('//title/text()').extract()[0]
post = ''
for line in selector.xpath('//div[@id="article_body"]/p/text()').extract():
post += line
url = response.url
print('TITLE: %s \n' % title)
print('CONTENT: %s \n' % post)
Results:
2017-11-22 12:19:19 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-22 12:19:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 132266,
'downloader/request_count': 315,
'downloader/request_method_count/GET': 315,
'downloader/response_bytes': 9204814,
'downloader/response_count': 315,
'downloader/response_status_count/200': 313,
'downloader/response_status_count/301': 2,
'dupefilter/filtered': 21126,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 22, 12, 19, 19, 295516),
'log_count/DEBUG': 318,
'log_count/INFO': 11,
'offsite/domains': 1,
'offsite/filtered': 312,
'request_depth_max': 4,
'response_received_count': 313,
'scheduler/dequeued': 315,
'scheduler/dequeued/memory': 315,
'scheduler/enqueued': 315,
'scheduler/enqueued/memory': 315,
'start_time': datetime.datetime(2017, 11, 22, 12, 14, 41, 591030)}
2017-11-22 12:19:19 [scrapy.core.engine] INFO: Spider closed (finished)
爬虫在一分钟后停止,它只返回最近的内容! 有什么解决办法吗?
【问题讨论】:
标签: python scrapy web-crawler scrapy-spider