如何使用beautifulsoup从html中的Json格式中提取数据答案

【问题标题】：How to extract data from Json format in html using beautifulsoup如何使用beautifulsoup从html中的Json格式中提取数据
【发布时间】：2021-04-24 04:20:43
【问题描述】：

我正在尝试使用 beautifulsoup 从 html 中的 json 格式中提取数据，如下所示。

<script type="application/ld+json">{
  "@context": "http://schema.org",
  "@type": "Movie",
  "url": "/title/tt1825683/",
  "name": "Black Panther",
  "image": "https://m.media-amazon.com/images/M/MV5BMTg1MTY2MjYzNV5BMl5BanBnXkFtZTgwMTc4NTMwNDI@._V1_.jpg",
  "genre": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "contentRating": "PG-13",
  "actor": [
    {
      "@type": "Person",
      "url": "/name/nm1569276/",
      "name": "Chadwick Boseman"
    },
    {
      "@type": "Person",
      "url": "/name/nm0430107/",
      "name": "Michael B. Jordan"
    },
    {
      "@type": "Person",
      "url": "/name/nm2143282/",
      "name": "Lupita Nyong\u0027o"
    },
    {
      "@type": "Person",
      "url": "/name/nm1775091/",
      "name": "Danai Gurira"
    }
  ],
  "director": {
    "@type": "Person",
    "url": "/name/nm3363032/",
    "name": "Ryan Coogler"
  },
  
}</script>

我到了提取整个 json 的这一部分，但我如何能够获取数据的特定属性？

soup_url = BeautifulSoup(url, 'html.parser')
url_info = soup_url.find_all("script",type="application/ld+json")

【问题讨论】：

循环遍历find_all返回的元素，获取每个元素的文本，然后调用json.loads()。
这将返回一个字典，然后您可以像访问任何其他 Python 字典一样访问它。
您遇到了哪些问题？

标签： python json web-scraping beautifulsoup

【解决方案1】：

正如 Barma 提到的，使用 json.loads() 而不是 .text 使用 .contents[0] 来获取字典部分。

for script in soup_url.find_all("script",type="application/ld+json"):
        res_dict = json.loads(script.contents[0])
        print(res_dict['name'])

示例

import json
from bs4 import BeautifulSoup

html='''<script type="application/ld+json">{
  "@context": "http://schema.org",
  "@type": "Movie",
  "url": "/title/tt1825683/",
  "name": "Black Panther",
  "image": "https://m.media-amazon.com/images/M/MV5BMTg1MTY2MjYzNV5BMl5BanBnXkFtZTgwMTc4NTMwNDI@._V1_.jpg",
  "genre": [
    "Action",
    "Adventure",
    "Sci-Fi"
  ],
  "contentRating": "PG-13",
  "actor": [
    {
      "@type": "Person",
      "url": "/name/nm1569276/",
      "name": "Chadwick Boseman"
    },
    {
      "@type": "Person",
      "url": "/name/nm0430107/",
      "name": "Michael B. Jordan"
    },
    {
      "@type": "Person",
      "url": "/name/nm2143282/",
      "name": "Lupita Nyong\u0027o"
    },
    {
      "@type": "Person",
      "url": "/name/nm1775091/",
      "name": "Danai Gurira"
    }
  ],
  "director": {
    "@type": "Person",
    "url": "/name/nm3363032/",
    "name": "Ryan Coogler"
  }
  
}</script>'''

soup_url = BeautifulSoup(html, 'html.parser')

for script in soup_url.find_all("script",type="application/ld+json"):
    res_dict = json.loads(script.contents[0])
    print(res_dict['name'])

【讨论】：