【问题标题】:How to extract data from Json format in html using beautifulsoup如何使用beautifulsoup从html中的Json格式中提取数据
【发布时间】:2021-04-24 04:20:43
【问题描述】:

我正在尝试使用 beautifulsoup 从 html 中的 json 格式中提取数据,如下所示。

<script type="application/ld+json">{
  "@context": "http://schema.org",
  "@type": "Movie",
  "url": "/title/tt1825683/",
  "name": "Black Panther",
  "image": "https://m.media-amazon.com/images/M/MV5BMTg1MTY2MjYzNV5BMl5BanBnXkFtZTgwMTc4NTMwNDI@._V1_.jpg",
  "genre": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "contentRating": "PG-13",
  "actor": [
    {
      "@type": "Person",
      "url": "/name/nm1569276/",
      "name": "Chadwick Boseman"
    },
    {
      "@type": "Person",
      "url": "/name/nm0430107/",
      "name": "Michael B. Jordan"
    },
    {
      "@type": "Person",
      "url": "/name/nm2143282/",
      "name": "Lupita Nyong\u0027o"
    },
    {
      "@type": "Person",
      "url": "/name/nm1775091/",
      "name": "Danai Gurira"
    }
  ],
  "director": {
    "@type": "Person",
    "url": "/name/nm3363032/",
    "name": "Ryan Coogler"
  },
  
}</script>

我到了提取整个 json 的这一部分,但我如何能够获取数据的特定属性?

soup_url = BeautifulSoup(url, 'html.parser')
url_info = soup_url.find_all("script",type="application/ld+json")

【问题讨论】:

  • 循环遍历find_all返回的元素,获取每个元素的文本,然后调用json.loads()
  • 这将返回一个字典,然后您可以像访问任何其他 Python 字典一样访问它。
  • 您遇到了哪些问题?

标签: python json web-scraping beautifulsoup


【解决方案1】:

正如 Barma 提到的,使用 json.loads() 而不是 .text 使用 .contents[0] 来获取字典部分。

for script in soup_url.find_all("script",type="application/ld+json"):
        res_dict = json.loads(script.contents[0])
        print(res_dict['name']) 

示例

import json
from bs4 import BeautifulSoup

html='''<script type="application/ld+json">{
  "@context": "http://schema.org",
  "@type": "Movie",
  "url": "/title/tt1825683/",
  "name": "Black Panther",
  "image": "https://m.media-amazon.com/images/M/MV5BMTg1MTY2MjYzNV5BMl5BanBnXkFtZTgwMTc4NTMwNDI@._V1_.jpg",
  "genre": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "contentRating": "PG-13",
  "actor": [
    {
      "@type": "Person",
      "url": "/name/nm1569276/",
      "name": "Chadwick Boseman"
    },
    {
      "@type": "Person",
      "url": "/name/nm0430107/",
      "name": "Michael B. Jordan"
    },
    {
      "@type": "Person",
      "url": "/name/nm2143282/",
      "name": "Lupita Nyong\u0027o"
    },
    {
      "@type": "Person",
      "url": "/name/nm1775091/",
      "name": "Danai Gurira"
    }
  ],
  "director": {
    "@type": "Person",
    "url": "/name/nm3363032/",
    "name": "Ryan Coogler"
  }
  
}</script>'''

soup_url = BeautifulSoup(html, 'html.parser')

for script in soup_url.find_all("script",type="application/ld+json"):
    res_dict = json.loads(script.contents[0])
    print(res_dict['name'])

【讨论】:

    猜你喜欢
    • 2019-05-24
    • 1970-01-01
    • 1970-01-01
    • 2019-06-16
    • 1970-01-01
    • 2015-09-29
    • 1970-01-01
    • 2019-06-02
    • 1970-01-01
    相关资源
    最近更新 更多