【发布时间】:2018-12-19 00:20:48
【问题描述】:
我需要网站所有页面的所有内部链接进行分析。我搜索了很多类似的问题。 我通过Mithu 找到了这段代码,它给出了可能的答案。但是,这并没有提供来自第二级页面深度的所有可能链接。 生成的csv 文件只有 676 条记录,但网站有 1000 条记录。
import csv // Done to avoid line gaps in the generated csv file
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from eylinks.items import LinkscrawlItem
outfile = open("data.csv", "w", newline='')
writer = csv.writer(outfile)
class ToscrapeSpider(scrapy.Spider):
name = "toscrapesp"
start_urls = ["http://books.toscrape.com/"]
rules = ([Rule(LinkExtractor(allow=r".*"), callback='parse', follow=True)])
def parse(self, response):
extractor = LinkExtractor(allow_domains='toscrape.com')
links = extractor.extract_links(response)
for link in links:
yield scrapy.Request(link.url, callback=self.collect_data)
def collect_data(self, response):
global writer
for item in response.css('.product_pod'):
product = item.css('h3 a::text').extract_first()
value = item.css('.price_color::text').extract_first()
lnk = response.url
stats = response.status
print(lnk)
yield {'Name': product, 'Price': value,"URL":lnk,"Status":stats}
writer.writerow([product,value,lnk,stats])
【问题讨论】:
标签: python-3.x scrapy scrapy-spider