【发布时间】:2015-07-22 12:00:27
【问题描述】:
基本上问题在于点击链接
我将从第 1..2..3..4..5........90 页开始
每个页面有 100 个左右的链接
每一页都是这种格式
http://www.consumercomplaints.in/lastcompanieslist/page/1
http://www.consumercomplaints.in/lastcompanieslist/page/2
http://www.consumercomplaints.in/lastcompanieslist/page/3
http://www.consumercomplaints.in/lastcompanieslist/page/4
这是正则表达式匹配规则
Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
我要去每个页面,然后创建一个Request 对象来抓取每个页面中的所有链接
Scrapy每次总共只爬取179个链接然后给出finished状态
我做错了什么?
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import urlparse
class consumercomplaints_spider(CrawlSpider):
name = "test_complaints"
allowed_domains = ["www.consumercomplaints.in"]
protocol='http://'
start_urls = [
"http://www.consumercomplaints.in/lastcompanieslist/"
]
#These are the rules for matching the domain links using a regularexpression, only matched links are crawled
rules = [
Rule(LinkExtractor(allow='(http:\/\/www\.consumercomplaints\.in\/lastcompanieslist\/page\/\d+)'),follow=True,callback="parse_data")
]
def parse_data(self, response):
#Get All the links in the page using xpath selector
all_page_links = response.xpath('//td[@class="compl-text"]/a/@href').extract()
#Convert each Relative page link to Absolute page link -> /abc.html -> www.domain.com/abc.html and then send Request object
for relative_link in all_page_links:
print "relative link procesed:"+relative_link
absolute_link = urlparse.urljoin(self.protocol+self.allowed_domains[0],relative_link.strip())
request = scrapy.Request(absolute_link,
callback=self.parse_complaint_page)
return request
return {}
def parse_complaint_page(self,response):
print "SCRAPED"+response.url
return {}
【问题讨论】:
-
对不起,我没听明白。您需要抓取 90 个链接?什么是 179 页?
-
@Nabin 编辑了问题,抱歉。我需要关注 90 页,每页有 100 个要抓取的链接。 Scrapy 总共只刮了 179 个
-
您确定每个页面内的所有这 100 个链接也在同一个域中吗?即 allowed_domain
-
是的,我确定。您可以通过在 url 末尾附加页面的 page_number 来检查页面的模板,例如 consumercomplaints.in/lastcompanieslist/page/2 您可以看到我正在尝试抓取的链接的大列表。我使用 xpath 选择器获取链接。粘贴的此代码有效。尝试直接运行代码检查是否需要
-
我很乐意看到你一开始使用 yield 而不是 return
标签: python web-crawler scrapy