【发布时间】:2020-07-30 06:04:27
【问题描述】:
我正在尝试让 Scrapy 蜘蛛按照我的蜘蛛代码发出 Request() 调用的顺序来抓取网站页面。
类似这个问题:Scrapy Crawl URLs in Order
我检查了这个问题的回答并尝试了它们,但它们都没有按照我需要的方式工作。
我的问题是我需要在页面上抓取表格。每个表都有一个,其中一个值是另一个页面的 href。第一个回调方法抓取表格,然后使用 href 对另一个页面进行后续 Request() 调用。第一个页面将调用许多其他页面。我使用 meta 关键字将 dict 中第一个回调方法的数据传递给第二个回调方法。
第二个回调方法会抓取该页面的内容并将解析后的数据添加到传递给它的 dict 中。但是第一个回调的数据并不总是与第二个回调的数据是针对同一个游戏的。
第一页的 XML 文档如下所示:
# Game Schedule page
<html>
<body>
<div>
<table type="games">
<tbody>
<tr row="1">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
<td data="game_id">1</td>
<td data="game_summary"><a href="/game/20200913_01.html">game stats</a></td>
</tr>
<tr row="2">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
<td data="game_id">2</td>
<td data="game_summary"><a href="/game/20200913_02.html">game stats</a></td>
</tr>
<tr row="3">
<th data="week_number">1</th>
<td data="date">"9/13/2020"</td>
<td data="game_id">3</td>
<td data="game_summary"><a href="/game/20200913_03.html">game stats</a></td>
</tr>
</tbody>
</table>
</div>
</body>
</html>
当然,有超过 3 个
# A sample Game Stat summary page
<html>
<body>
<h1>Team A @ Team B</h1>
<div class="game stat">
<table type="game stat">
<tbody>
<tr row="1">
<td data="date">"9/13/2020"</td>
<td data="game_id">1</td>
<td data="game_time">2:00PM EST</td>
<td data="visit_team">Team A</td>
<td data="visit_team_score">43</td>
<td data="home_team">Team B</td>
<td data="home_team_score">53</td>
</tr>
</tbody>
</table>
</body>
</html>
我的蜘蛛抓取游戏摘要页面,使用 XPath 解析每个
import os
import sys
import urlparse
from lxml import etree, html
from scrapy.http import Request
from scrapy.loader import ItemLoader
class TestSpider(scrapy.Spider):
name = "test_spider"
season_flag = False
season_val = ""
"""
I need to override the __init__() method of scrapy.Spider
because I need to define some attributes/variables
"""
def __init__(self, *a, **kw):
super(TestSpider, self).__init__(*a, **kw)
self.season_flag = False
self.debug_flag = False
self.season_val = ""
self.game_list = list()
self.game_dict = dict()
if hasattr(self, "season"):
self.season_val = str(self.season)
self.season_flag = True
else:
self.log("No season argument. Exiting")
sys.exit(1)
if hasattr(self, "debug"):
if self.debug == True:
self.debug_flag = True
"""
Start the request by starting the scraping at
page that has the game schedule in a table
"""
def start_requests(self):
url_list = [
"https://somewebsite.com/2019.GameSchedule.htm"
]
for url in url_list:
yield Request(url=url,\
callback=self.parse_schedule_summary_page)
回调方法“parse_schedule_summary_page”解析游戏日程页面中的
它发出“yield Request()”调用,Request() 的部分参数是 game_dict,使用“meta”关键字。
def parse_schedule_summary_page(self, response):
"""
Convert the response object to an lxml tree object.
"""
decoded = response.body.decode('utf-8')
html_tree = html.fromstring(decoded)
l_game_elem_list = list()
# This extracts all the <TR>s from the 'games' table and stores it in a list
l_game_elem_list = html_tree.xpath("//table[@type = 'games']/tbody/tr")
num_l_games = len(l_game_elem_list)
# Iterate thru each of the <TR> elements
for i in range(num_l_games):
game_dict = dict()
"""
Parse the week number, date, game id, and URL to
the game stat page
"""
l_game_elem = l_game_elem_list[i]
p_weeknum = l_game_elem.xpath(".//th[@data = 'week_num']/text()")
p_date = l_game_elem.xpath(".//td[@data = 'date']/text()")
p_game_id = l_game_elem.xpath(".//td[@data = 'game_id']/text()")
summary_url = l_game_elem.xpath("string(.//a[string() = 'game stats']/@href)")
game_dict['week_num'] = p_weeknum
game_dict['date'] = p_date
game_dict['game_id'] = p_game_id
# This is where the code gets wonky
yield Request(summary_url, priority=5, meta={'dict': game_dict},\
callback=self.parse_game_page)
parse_schedule_summary_page() 然后对游戏统计摘要页面进行 Request 调用,其回调方法为 parse_game_page
def parse_game_page(self, response):
game_dict = response.meta.get('dict')
"""
Convert the response object to an lxml tree object.
"""
decoded = response.body.decode('utf-8')
html_tree = html.fromstring(decoded)
game_date = xpath("//td[@data = 'date']/text()")
game_id = xpath("//td[@data = 'game_id']/text()")
game_time = xpath("//td[@data = 'game_time']/text()")
v_team = xpath("//td[@data = 'visit_team']/text()")
v_team_score = xpath("//td[@data = 'visit_team_score']/text()")
h_team = xpath("//td[@data = 'home_team']/text()")
h_team_score = xpath("//td[@data = 'home_team_score']/text()")
game_dict['game_time'] = game_time
game_dict['x_game_id'] = game_id
# I copy the rest of the values I parsed via XPath into
# the game_dict dictionary. I won't repeat the code here
# for brevity's sake.
# Here's where I print it out, for debugging purposes
stmt = "==**==**==**==\n"
stmt += str(game_dict)
stmt += "\n**==**==**==**"
self.log(stmt)
从 self.log(stmt) 语句的输出中,我注意到 game_id 条目和 x_game_id 条目在应该是不同的时候:
==**==**==**==
{'v_team': 'Team A', 'game_time': '6:30PM','game_id': '1', 'h_team_score': '53',
'h_team': 'Team B', 'week_num': '1', 'date': '9/13/2020', 'x_game_id': '7', 'v_team_score': '43'}
**==**==**==**
来自 parse_schedule_summary_page() 的 game_id 与来自 parse_game_page() 的 x_game_id 不匹配。大多数游戏都是这样,但不是所有游戏。
参考前面的问题,这是因为 Scrapy 无法确保查看的 URL 中的顺序。
根据它的建议,我首先在我的 settings.py 文件中更改了这个配置:
# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1
这没有帮助;当我使用此选项设置运行游戏时,游戏数据仍然不同步。
我尝试在 parse_schedule_summary_page() 中设置 Request() 的优先级,但这也没有解决问题。
所以我尝试了另一个建议并更改了此代码:
yield Request(summary_url, priority=5, meta={'dict': game_dict},\
callback=self.parse_game_page)
到这里:
return [Request(summary_url, priority=5, meta={'dict': game_dict},\
callback=self.parse_game_page)]
通过不使用 yield 命令,来自 Game Schedule 页面中的
如何让 parse_schedule_summary_page() 中的 Request() 调用调用
【问题讨论】:
-
我尝试了不同的方法来解决这个问题。但它们都不起作用。唯一的方法似乎是没有单独的 parse_game_page 回调并使用 scrapy_inline_request 模块。有没有人有其他选择或想法?
-
只是我的猜测,但另一种方法是在我的 Spider 中使用 BeautifulSoup。但这似乎太不雅了。