【发布时间】:2017-02-08 06:06:58
【问题描述】:
我正在尝试每年抓取 Billboard 前 100 名。我有一个文件可以一次使用一年,但我希望它能够抓取所有年份并收集这些数据。这是我当前的代码:
from scrapy import Spider
from scrapy.selector import Selector
from Billboard.items import BillboardItem
from scrapy.exceptions import CloseSpider
from scrapy.http import Request
URL = "http://www.billboard.com/archive/charts/%/hot-100"
class BillboardSpider(Spider):
name = 'Billboard_spider'
allowed_urls = ['http://www.billboard.com/']
start_urls = [URL % 1958]
def _init_(self):
self.page_number=1958
def parse(self, response):
print self.page_number
print "----------"
rows = response.xpath('//*[@id="block-system-main"]/div/div/div[2]/table/tbody/tr').extract()
for row in rows:
IssueDate = Selector(text=row).xpath('//td[1]/a/span/text()').extract()
Song = Selector(text=row).xpath('//td[2]/text()').extract()
Artist = Selector(text=row).xpath('//td[3]/a/text()').extract()
item = BillboardItem()
item['IssueDate'] = IssueDate
item['Song'] = Song
item['Artist'] = Artist
yield item
self.page_number += 1
yield Request(URL % self.page_number)
但我收到错误消息:“start_urls = [URL % 1958] ValueError: 索引 41 处不支持的格式字符 '/' (0x2f)"
有什么想法吗?我希望代码从原始的“URL”链接自动将年份更改为 1959,并逐年继续,直到它停止查找表,然后关闭。
【问题讨论】:
标签: python web-scraping scrapy scrapy-spider