【发布时间】:2017-08-22 08:55:14
【问题描述】:
我有一个网页要抓取。在页面上,是<table> 中的链接列表。我正在尝试使用规则部分来要求 Scrapy 通过链接,并获取链接目标页面上的数据。以下是我的代码:
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'coinmarketcap'
start_urls = [
'https://coinmarketcap.com/currencies/views/all/'
]
rules = (
Rule(LinkExtractor(allow=(), restrict_xpaths=('//tr/td[2]/a/@href',)), callback="parse", follow= True),
)
def parse(self, response):
print("TEST TEST TEST")
BTC = BTCItem()
BTC['source'] = str(response.request.url).split("/")[2]
BTC['asset'] = str(response.request.url).split("/")[4],
BTC['asset_price'] = response.xpath('//*[@id="quote_price"]/text()').extract(),
BTC['asset_price_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/span[2]/text()').extract(),
BTC['BTC_price'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[1]/text()').extract(),
BTC['Prct_change'] = response.xpath('/html/body/div[2]/div/div[1]/div[3]/div[2]/small[2]/text()').extract()
yield (BTC)
我的问题是 Scrapy 没有关注这些链接。它只是在尝试从该链接中提取数据时使用该链接。我错过了什么?
更新#1: 为什么是抓取与抓取?
2017-03-28 23:10:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://coinmarketcap.com/currencies/pivx/> (referer: None)
2017-03-28 23:10:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://coinmarketcap.com/currencies/zcash/> (referer: None)
2017-03-28 23:10:33 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://coinmarketcap.com/currencies/bitcoin/> (referer: None)
2017-03-28 23:10:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://coinmarketcap.com/currencies/nem/>
【问题讨论】:
标签: python web-scraping scrapy scrapy-spider