在 Python2 中通过 scrapy 从 web 读取 json答案

【问题标题】：read json from web by scrapy in Python2在 Python2 中通过 scrapy 从 web 读取 json
【发布时间】：2018-05-05 16:53:46
【问题描述】：

我想从网页中提取 JSON 数据，所以我检查了它。我需要的数据以以下格式存储：

<script type="application/ld+json">
    {
     'data I want to extract'
    }
    </script>

我尝试使用：

import scrapy
import json

class OpenriceSpider(scrapy.Spider):
    name = 'openrice'
    allowed_domains = ['www.openrice.com']

    def start_requests(self):
        headers = {
            'accept-encoding': 'gzip, deflate, sdch, br',
            'accept-language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',
            'upgrade-insecure-requests': '1',
            'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36     (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
            'accept':     'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'cache-control': 'max-age=0',
        }
        url = 'https://www.openrice.com/en/hongkong/r-kitchen-one-cafe-sha-tin-western-r483821'
        yield scrapy.Request(url=url, headers=headers, callback=self.parse)

    def parse(self, response):  # response = request url ?
        items = []
        jsonresponse = json.loads(response)

但它不起作用，我应该如何改变它？

【问题讨论】：

标签： python json python-2.7 web-scraping scrapy

【解决方案1】：

您需要在 HTML 源代码中找到 script 元素，提取其文本，然后才使用 json.loads() 加载：

script = response.xpath("//script[@type='application/ld+json']/text()").extract_first()
json_data = json.loads(script)
print(json_data)

在这里，我使用不那么常见的application/ld+json 来定位script，但还有许多其他选项 - 例如，使用您知道它在脚本本身中的一些文本来定位脚本：

//script[contains(., 'Restaurant')]

【讨论】：