【问题标题】:Scrapy get data through json tagsScrapy通过json标签获取数据
【发布时间】:2021-07-05 13:52:58
【问题描述】:
# -*- coding: utf-8 -*-
import scrapy
from ..items import HomedepotItem
import re
import pandas as pd
import requests
import json
from bs4 import BeautifulSoup



class HomedepotSpider(scrapy.Spider):
    name = 'homeDepot'


    start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']
     


    def parse(self, response):


        for item in self.parseHomeDepot(response):
            yield item
        pass

    def parseHomeDepot(self, response):
        item = HomedepotItem() #items from items.py


        jsonresponse = json.loads(response.text)
        productPrice = jsonresponse(["offers"][0]["price"])
        

     
        #item['productPrice'] = productPrice #display price and assign to variable
   

        yield item

我正在尝试从该网页的 json 中解析数据。我之前回答了一个关于 json 的问题,并且 ["offers"]["prices"] 是要走的路,因为网页的 json 是

"offers":{"@type":"Offer","url":"https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560","priceCurrency":"USD","price":1449.95,"priceValidUntil":"4/7/2021","availability":"https://schema.org/InStock"}

所以现在我得到了错误:raise JSONDecodeError("Expecting value", s, err.value) from None

任何帮助将不胜感激!

【问题讨论】:

  • 您收到错误是因为您尝试在整个网页上执行json.loads,而不仅仅是 json 组件
  • @tomjn 所以我会在我的 json 响应中加载 offer 对象,然后循环遍历它以尝试获取价格?
  • @TowsifAhamedLabib 我不认为我可以使用 response.css,因为内容是动态生成的
  • 既然你提到了一个之前的问题,我看了那个问题,这就是你问题的答案。我在这里错过了什么?您可以使用response.css 加载json,类似于您上一个问题的答案...
  • @tomjn 我确实尝试过,但我可能加载不正确,谢谢!

标签: python web-scraping beautifulsoup scrapy


【解决方案1】:

您收到此错误是因为您不能仅使用纯 response.text 来获取 <script> 标记中的 JSON

你想要的JSONtypeapplication/ld+json的第一个script标签中。

你必须定位那个特定的标签,然后json.loads解析它。

例如:

# -*- coding: utf-8 -*-
import json

import scrapy


class HomedepotSpider(scrapy.Spider):
    name = 'homeDepot'
    start_urls = ['https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560']

    def parse(self, response):
        script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
        yield json.loads(script_tag)

这是来自scrapy shell 的示例:

scrapy shell 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560'
...

[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7f2d56604160>
[s]   item       {}
[s]   request    <GET https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s]   response   <200 https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560>
[s]   settings   <scrapy.settings.Settings object at 0x7f2d56680ac0>
[s]   spider     <DefaultSpider 'default' at 0x7f2d56105850>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> script_tag = response.xpath('//script[@type="application/ld+json"][1]/text()').get()
>>> import json
>>> json.loads(script_tag)["offers"]
{'@type': 'Offer', 'url': 'https://www.homedepot.com/p/ZLINE-Kitchen-and-Bath-36-DuraSnow-Stainless-Steel-Range-Hood-with-Hand-Hammered-Copper-Shell-8654HH-36-8654HH-36/311287560', 'priceCurrency': 'USD', 'price': 1449.95, 'priceValidUntil': '4/12/2021', 'availability': 'https://schema.org/InStock'}
>>> json.loads(script_tag)["offers"]["price"]
1449.95

【讨论】:

  • 谢谢!快速的问题,是 //script[@type="application/ld+json" 从网页中抓取 json 元素的标准,还是因网站而异?
  • @chrisHG 答案是——视情况而定。但是,application/ld+json&lt;script&gt; 的一种相对常见的类型,它携带一些通常由 JavaScript 消耗的有效负载。
  • 知道了我 ctrl+f'd 页面源,现在知道要查找什么。感谢您的所有帮助,我从您的回答中学到了很多东西!
  • 很高兴为您提供帮助!快乐的编码和抓取! :)
猜你喜欢
  • 2014-12-16
  • 1970-01-01
  • 2012-04-10
  • 1970-01-01
  • 2016-02-03
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2018-08-19
相关资源
最近更新 更多