Scrapy通过json标签获取数据答案

【问题标题】：Scrapy get data through json tagsScrapy通过json标签获取数据
【发布时间】：2021-07-05 13:52:58
【问题描述】：

# -*- coding: utf-8 -*-
import scrapy
from ..items import HomedepotItem
import re
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup



class HomedepotSpider(scrapy.Spider):
    name = 'homeDepot'


    start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
     


    def parse(self, response):


        for item in self.parseHomeDepot(response):
            yield item
        pass

    def parseHomeDepot(self, response):
        item = HomedepotItem() #items from items.py


        jsonresponse = json.loads(response.text)
        productPrice = jsonresponse(["offers"][0]["price"])
        

     
        #item['productPrice'] = productPrice #display price and assign to variable
   

        yield item

我正在尝试从该网页的 json 中解析数据。我之前回答了一个关于 json 的问题，并且 ["offers"]["prices"] 是要走的路，因为网页的 json 是

"offers":{"@type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}

所以现在我得到了错误：raise JSONDecodeError("Expecting value", s, err.value) from None

任何帮助将不胜感激！

【问题讨论】：

您收到错误是因为您尝试在整个网页上执行json.loads，而不仅仅是 json 组件
@tomjn 所以我会在我的 json 响应中加载 offer 对象，然后循环遍历它以尝试获取价格？
@TowsifAhamedLabib 我不认为我可以使用 response.css，因为内容是动态生成的
既然你提到了一个之前的问题，我看了那个问题，这就是你问题的答案。我在这里错过了什么？您可以使用response.css 加载json，类似于您上一个问题的答案...
@tomjn 我确实尝试过，但我可能加载不正确，谢谢！

标签： python web-scraping beautifulsoup scrapy

【解决方案1】：

您收到此错误是因为您不能仅使用纯 response.text 来获取 <script> 标记中的 JSON。

你想要的JSON在typeapplication/ld+json的第一个script标签中。

你必须定位那个特定的标签，然后用json.loads解析它。

例如：

# -*- coding: utf-8 -*-
import json

import scrapy


class HomedepotSpider(scrapy.Spider):
    name = 'homeDepot'
    start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']

    def parse(self, response):
        script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
        yield json.loads(script_tag)

这是来自scrapy shell 的示例：

scrapy shell 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
...

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f2d56604160>
[s]   item       {}
[s]   request    <GET https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s]   response   <200 https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s]   settings   <scrapy.settings.Settings object at 0x7f2d56680ac0>
[s]   spider     <DefaultSpider 'default' at 0x7f2d56105850>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
>>> import json
>>> json.loads(script_tag)["offers"]
{'@type': 'Offer', 'url': 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560', 'priceCurrency': 'USD', 'price': 1449.95, 'priceValidUntil': '4/12/2021', 'availability': 'https://schema.org/InStock'}
>>> json.loads(script_tag)["offers"]["price"]
1449.95

【讨论】：

谢谢！快速的问题，是 //script[@type="application/ld+json" 从网页中抓取 json 元素的标准，还是因网站而异？
@chrisHG 答案是——视情况而定。但是，application/ld+json 是 <script> 的一种相对常见的类型，它携带一些通常由 JavaScript 消耗的有效负载。
知道了我 ctrl+f'd 页面源，现在知道要查找什么。感谢您的所有帮助，我从您的回答中学到了很多东西！
很高兴为您提供帮助！快乐的编码和抓取！ :)