使用 BeautifulSoup 解析脚本标签内的 json 文本答案

【问题标题】：Parsing json text inside a script tag using BeautifulSoup使用 BeautifulSoup 解析脚本标签内的 json 文本
【发布时间】：2019-03-06 17:52:01
【问题描述】：

我正在尝试使用 BeautifulSoup 提取 Python3 中 ('script', type='application/ld+json') 的 @context 元素中的文本。

我在一页中有多个脚本，我想获得上面 json 中列出的特定功能。

我尝试使用此代码：

data = soup.find_all('script', type='application/ld+json')
print(data)

这给了我所有脚本的完整提取内容，但我想在每个脚本的上下文中获得一个特定的功能。

功能示例：

{"name":"test","telephone":"600.212.0000","url":"https://test.com/test"}

对于这个例子，我想获得"url" 部分。

有人知道用 Python 做吗？

非常感谢您的帮助。

【问题讨论】：

标签： python json web-scraping beautifulsoup findall

【解决方案1】：

您可以使用get() 的列表推导：

data = soup.find_all('script', type='application/ld+json')

urls = [i.get('url') for i in data]

【讨论】：

【解决方案2】：

由于您的功能是 dict，您可以尝试以下操作：

feature = {"name":"test","telephone":"600.212.0000","url":"https://test.com/test"}
print(feature["url"])

【讨论】：

【解决方案3】：

其他答案中缺少的是将从脚本标签中提取的内容转换为 JSON（我们可以使用 json 库），然后从字典中选择我们感兴趣的字段。

import requests, json
from bs4 import BeautifulSoup

src = requests.get("YOUR_URL").content
soup = BeautifulSoup(src,'html.parser')
res = soup.find('script', type='application/ld+json')

json_object = json.loads(res.contents[0])
print(json_object['url'])

【讨论】：