【发布时间】:2020-06-15 10:54:39
【问题描述】:
我正在创建一个蜘蛛,使用 scrapy 从 rottentomatoes.com 抓取详细信息。由于搜索页面是动态呈现的,我使用 rottentomatoes API for eg:https://www.rottentomatoes.com/api/private/v2.0/search?q=inception 来获取搜索结果和 URL。通过scrapy的URL,我能够提取番茄计分、观众分数、导演、演员等。但是,我也想提取所有观众评论。问题是,观众评论页面(https://www.rottentomatoes.com/m/inception/reviews?type=user)使用分页工作,我无法从下一页提取数据,而且我也找不到使用 API 提取详细信息的方法。谁能帮我解决这个问题。
def parseRottenDetail(self, response):
print("Reached Tomato Parser")
try:
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
items = TomatoCrawlerItem()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['tomatometerScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoAudienceScore'] = response.css(
'.mop-ratings-wrap__row .mop-ratings-wrap__half.audience-score .mop-ratings-wrap__percentage::text').get().strip()
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][
'tomatoCriticConsensus'] = response.css('p.mop-ratings-wrap__text--concensus::text').get()
if MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["type"] == "Movie":
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//ul[@class='content-meta info']/li[@class='meta-row clearfix']/div[contains(text(),'Directed By')]/../div[@class='meta-value']/a/text()").get()
else:
MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]['Director'] = response.xpath(
"//div[@class='tv-series__series-info-castCrew']/div/span[contains(text(),'Creator')]/../a/text()").get()
reviews_page = response.css('div.mop-audience-reviews__view-all a[href*="reviews"]::attr(href)').get()
if len(reviews_page) != 0:
yield response.follow(reviews_page, callback=self.parseRottenReviews)
else:
for key in MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse].keys():
if "pageURL" not in key and "type" not in key:
items[key] = MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse][key]
yield items
if MoviecrawlSpider.current_parse <= MoviecrawlSpider.total_results:
MoviecrawlSpider.current_parse += 1
print("Parse Values are Current Parse " + str(
MoviecrawlSpider.current_parse) + "and Total Results " + str(MoviecrawlSpider.total_results))
yield response.follow(MoviecrawlSpider.parse_rotten_list[MoviecrawlSpider.current_parse]["pageURL"],
callback=self.parseRottenDetail)
except Exception as e:
exc_type, exc_obj, exc_tb = sys.exc_info()
print(e)
print(exc_tb.tb_lineno)
执行这段代码后,我到达评论页面,例如:https://www.rottentomatoes.com/m/inception/reviews?type=user,此后有一个下一步按钮,使用分页加载下一页。那么提取所有评论的方法应该是什么?
def parseRottenReviews(self, response):
print("Reached Rotten Review Parser")
items = TomatoCrawlerItem()
【问题讨论】:
-
真的很有帮助,非常感谢:D
标签: python python-3.x web-scraping scrapy