【发布时间】:2019-04-16 09:14:10
【问题描述】:
我正在用 Python Scrapy 编写一个网络爬虫,它会爬取标签目录的多个页面并获取所有标签及其文章。
所以我得到了蜘蛛在每个页面中运行的这种解析方法。
def parse_word(self, response):
# look for all tags on this site
tagscount = response.xpath('someXpath').extract()
# check if there is a nextPage
nextPage = response.css('somecssSelector').extract()
lastPage = response.css('somecssSelector').extract()
# Open every tagsite and crawl it if all tags are gained
if not nextPage or lastPage:
for tag in tagscount:
# call parse method for article crawling
data = scrapy.Request(url=tag, callback=self.parse_subpage)
yield data
# If there is a nextPage with tags request with this method recursively
else:
# a little bit of formatting for linktype
nextPageStr = str(nextPage)
cutNextPageStr = nextPageStr.replace("[","")
cutNextPageStr = cutNextPageStr.replace("]", "")
cutNextPageStr = cutNextPageStr.replace("'", "")
link = urljoin(response.url, cutNextPageStr)
# Call this method again --> here i want to set a parameter tagscount or something like this
data = scrapy.Request(url=link, callback=self.parse_word)
yield data
在 else 部分我想为 parse_word 方法提供获得的标签,但整个方法只采用最后一个站点的标签。
谁能帮帮我?
【问题讨论】:
-
我解决了这个问题。我将一个类变量实现为一个列表,并在每个方法中调用 tagscount 变量。
标签: python parsing methods parameters scrapy