【发布时间】:2019-02-06 21:17:39
【问题描述】:
我通过扩展 CrawlSpider 创建了一个蜘蛛。
当蜘蛛运行并找到文章页面时,我想获取作者个人资料的链接并向个人资料页面发出请求并使用 parse_author 对其进行解析,但由于某种原因,此 parse_author 回调永远不会执行。
我的代码:
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http.request import Request
class CityamSpider4(CrawlSpider):
name = "city_am_v4"
custom_settings = {
'CONCURRENT_REQUESTS': '1',
}
allowed_domains = ['cityam.com']
start_urls = [
'http://www.cityam.com',
]
rules = (
Rule(LinkExtractor(deny=('dev2.cityam.com', 'sponsored-content', )), callback='parse_item'),
)
def parse_item(self, response):
# parse article page
article_title = response.css('.article-headline h1::text').extract_first(default='null').strip()
if article_title is not 'null':
print 'Article url : ' + response.url
author_url = response.css('.author-container .author-text a.author-name::attr(href)').extract_first(default='null').strip()
print 'Author link: ' + author_url
author_url = response.urljoin(author_url)
print 'Author link: ' + author_url
yield Request(author_url, callback=self.parse_author)
def parse_author(self, response):
# parse author page
author_name = response.css(".cam-profile-header-title::text").extract_first(default='null').strip()
print 'Author name: ' + author_name
yield {
'name': author_name,
}
【问题讨论】:
-
在我看来python版本不会随着时间的推移而回滚,谁用版本2?
-
只有加莱西奥。也许他是对的。回到 python 2 你会很开心
标签: scrapy web-crawler scrapy-spider