json 文件不是用 Python Scrapy Spider 创建的答案

【问题标题】：json file is not created with Python Scrapy Spiderjson 文件不是用 Python Scrapy Spider 创建的
【发布时间】：2019-01-05 03:56:21
【问题描述】：

我想做的事

我想使用 Python 的 Scrapy 蜘蛛制作 json 文件。我目前正在学习“使用 Python 和 JavaScript 进行数据可视化”。在抓取中，不知道为什么没有创建json文件。

目录结构

/root
nobel_winners   scrapy.cfg

/nobel_winners:
__init__.py     items.py    pipelines.py    spiders
__pycache__     middlewares.py    settings.py

/nobel_winners/spiders:
__init__.py     __pycache__     nwinners_list_spider.py

工作流程/代码

在 /nobel_winners/spiders 的 nwinners_list_spider.py 中输入以下代码。

#encoding:utf-8

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            country = h2.xpath('span[@class="mw-headline"]/text()').extract()

在根目录下输入以下代码。

scrapy crawl nwinners_list -o nobel_winners.json

错误

出现如下显示，json文件中没有输入数据。

2018-07-25 10:01:53 [scrapy.core.engine] INFO: Spider opened
2018-07-25 10:01:53 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

我尝试了什么

1.在正文中，它的来源有点长，但我只检查了“国家”变量。

2.我进入scrapy shell并使用基于IPython的shell检查每个人的动作。并确认该值牢牢在“国家”。

h2s = response.xpath('//h2')

for h2 in h2s:
    country = h2.xpath('span[@class="mw-headline"]/text()').extract()
    print(country)

【问题讨论】：

标签： python json scrapy

【解决方案1】：

尝试使用此代码：

import scrapy

class NWinnerItem(scrapy.Item):
    country = scrapy.Field()

class NWinnerSpider(scrapy.Spider):
    name = 'nwinners_list'
    allowed_domains = ['en.wikipedia.org']
    start_urls = ["https://en.wikipedia.org/wiki/List_of_Nobel_laureates_by_country"]

    def parse(self, response):

        h2s = response.xpath('//h2')

        for h2 in h2s:
            yield NWinnerItem(
                country = h2.xpath('span[@class="mw-headline"]/text()').extract_first()
            )

然后运行 scrapy crawl nwinners_list -o nobel_winners.json -t json

在回调函数中，您解析响应（网页）并返回带有提取数据的字典、Item 对象、Request 对象或这些对象的可迭代对象。 See scrapy documentation

这就是刮掉0件的原因，你需要退货！

另请注意，.extract() 根据您的 xpath 查询返回一个列表，.extract_first() 返回列表的第一个元素。

【讨论】：

感谢您的好评。我试过你的代码，所以它有效！我将添加更多代码并继续。