Python Scrapy 和屈服答案

【问题标题】：Python Scrapy and yieldingPython Scrapy 和屈服
【发布时间】：2017-06-07 23:16:30
【问题描述】：

我目前正在第一次使用 Scrapy 开发刮板，我也是第一次使用 Yield。我对 yield 的工作原理感到非常困惑。

刮刀：

抓取一页以获取日期范围列表
使用这些日期范围来格式化 URL，然后抓取另一个页面，其中包含分页成 10 个列表组的列表
我想废弃所有这些链接到 10 个列表的网址

然后在这些页面上，我想废弃所有列表并从中提取数据。这些单独的列表也有 4 个“标签”需要删除。

class MyScraper(scrapy.Spider):
name = "myscraper"

start_urls = [
]


def parse(self, response):
    rows = response.css('table.apas_tbl tr').extract()
    for row in rows[1:]:
        soup = BeautifulSoup(row, 'lxml')
        url = soup.find_all("a")[1]['href']
        yield scrapy.Request(url, callback=self.parse_page_contents)

def parse_page_contents(self, response):
    rows = response.xpath('//div[@id="apas_form"]').extract_first()
    soup = BeautifulSoup(rows, 'lxml')
    pages = soup.find(id='apas_form_text')
    for link in pages.find_all('a'):
        url = link['href']
        yield scrapy.Request(url, callback=self.parse_page_listings)

def parse_page_listings(self, response):
    rows = response.xpath('//div[@id="apas_form"]').extract_first()
    soup = BeautifulSoup(rows, 'lxml')
    resultTable = soup.find("table", { "class" : "apas_tbl" })

    for row in resultTable.find_all('a'):
        url = row['href']
        yield scrapy.Request(url, callback=self.parse_individual_listings)


def parse_individual_listings(self, response): 
    rows = response.xpath('//div[@id="apas_form"]').extract_first() 
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div',{'id':'fieldset_data'})
    data = {}
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    tabs = response.xpath('//div[@id="tabheader"]').extract_first() 
    soup = BeautifulSoup(tabs, 'lxml')
    links = soup.find_all("a")
    for link in links:
        yield scrapy.Request(
            urlparse.urljoin(response.url, link['href']), 
            callback=self.parse_individual_tabs,
            meta={'data': data}
        )
     print data

def parse_individual_tabs(self, response): 
    data = {}
    rows = response.xpath('//div[@id="tabContent"]').extract_first() 
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div',{'id':'fieldset_data'})
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    yield json.dumps(data)

不过，刮板目前似乎在处理一些问题。目前主要关注的是：

错误：Spider 必须返回 Request、BaseItem、dict 或 None，得到 'str' in

还有一些重复的 url 被抓取。我想知道（a）是什么导致了上面的错误，（b）yield 设置的格式是否正确？

【问题讨论】：

您不必使用yield：您可以简单地构建一个列表，将所有yielded 元素存储到其中并返回该列表。
@WillemVanOnsem 虽然你是对的，但使用yield 值而不是将它们存储到临时列表然后返回整个列表更方便，通常也是一个更好的主意。不仅yield 看起来更好，而且使用yield 而不是return 会将函数变成性能更好的生成器。
@Granitosaurus：是的，我一直都在使用yield（尽管在我的工作中是唯一一个喜欢它的人）。它对内存也更好（假设您发出一百万个对象，每个对象占用几兆字节，然后列表无法存储它们）。但在这种情况下，使用列表时学习曲线可能不那么陡峭。

标签： python scrapy

【解决方案1】：

您正在 parse_individual_listings 中生成请求，因此无需在 parse_individual_tabs 中生成数据。因为生成器很懒所以需要返回来运行它。

更正的代码：

import json

从 urllib.parse 导入 urljoin

导入scrapy 从 bs4 导入 BeautifulSoup

类 MyScraper(scrapy.Spider): name = "myscraper"

start_urls = [
]

def parse(self, response):
    rows = response.css('table.apas_tbl tr').extract()
    for row in rows[1:]:
        soup = BeautifulSoup(row, 'lxml')
        url = soup.find_all("a")[1]['href']
        yield scrapy.Request(url, callback=self.parse_page_contents)

def parse_page_contents(self, response):

    rows = response.xpath('//div[@id="apas_form"]').extract_first()
    soup = BeautifulSoup(rows, 'lxml')
    pages = soup.find(id='apas_form_text')
    for link in pages.find_all('a'):
        url = link['href']
        yield scrapy.Request(url, callback=self.parse_page_listings)


def parse_page_listings(self, response):
    rows = response.xpath('//div[@id="apas_form"]').extract_first()
    soup = BeautifulSoup(rows, 'lxml')
    resultTable = soup.find("table", {"class": "apas_tbl"})

    for row in resultTable.find_all('a'):
        url = row['href']
        yield scrapy.Request(url, callback=self.parse_individual_listings)


def parse_individual_listings(self, response):
    rows = response.xpath('//div[@id="apas_form"]').extract_first()
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div', {'id': 'fieldset_data'})
    data = {}
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    tabs = response.xpath('//div[@id="tabheader"]').extract_first()
    soup = BeautifulSoup(tabs, 'lxml')
    links = soup.find_all("a")
    for link in links:
        yield scrapy.Request(
            urljoin(response.url, link['href']),
            callback=self.parse_individual_tabs,
            meta={'data': data}
        )
    print
    data


def parse_individual_tabs(self, response):
    data = {}
    rows = response.xpath('//div[@id="tabContent"]').extract_first()
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div', {'id': 'fieldset_data'})
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    return json.dumps(data)

【讨论】：

【解决方案2】：

您在其中一种解析方法中返回字符串：

def parse_individual_tabs(self, response): 
    data = {}
    rows = response.xpath('//div[@id="tabContent"]').extract_first() 
    soup = BeautifulSoup(rows, 'lxml')
    fields = soup.find_all('div',{'id':'fieldset_data'})
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    yield json.dumps(data)  #<---- here

正如错误消息所说：

错误：Spider 必须返回 Request、BaseItem、dict 或 None，得到 'str' in

所以只需 yield data，因为 data 是一个字典。

编辑：
关于您的第二个问题 - 您的最后两个解析方法存在问题：

def parse_individual_listings(self, response): 
    # <..>
    data = {}
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()

    # <..>
    for link in links:
        yield scrapy.Request(
            urlparse.urljoin(response.url, link['href']), 
            callback=self.parse_individual_tabs,
            meta={'data': data}  # <-- you carry data here to below
        )

def parse_individual_tabs(self, response): 
    data = {}  # <--- here's the issue
    # instead you should retrieve data carried from above:
    data = response.meta['data']
    # <..>
    for field in fields:
        data[field.label.text.strip()] = field.p.text.strip()
    return data  # also cine there's only 1 element you can just return it instead of yielding, it makes no difference

【讨论】：

谢谢你。这修复了错误。但是，现在每个“选项卡”中的信息都会单独显示。它似乎没有连接到 parse_individual_listings 中的数据字典。我的收益结构是否有错误？
@19421608 对，我明白了，看看我对此的编辑。