如何使用 BeautifulSoup4 解析标签属性？答案

【问题标题】：How to parse tag attributes with BeautifulSoup4?如何使用 BeautifulSoup4 解析标签属性？
【发布时间】：2021-11-18 13:40:03
【问题描述】：

我想知道如何使用 BeautifulSoup 在以下 html 代码中解析这种样式（javascript？）的标记属性：

<div class="class1" data-prop="{personName: 'Claudia', personCode:'123456'}">
...
</div>

我目前只是遵循标准流程，直到我到达我当前正在使用正则表达式解析的属性的内容，但是我想知道是否有更好/更快/更优雅的选项：

soup = BeautifulSoup(data,'html.parser')
class_element = soup.find("div", class_="class1")
data-props=class_element['data-prop']
# Parsing using regexp goes here

【问题讨论】：

import json; data = json.loads(data_props)?
@buran 哦，对不起，现在我明白为什么了
是的，这不起作用。我已经试过了。
OP 你能指定使用正则表达式解析什么或如何解析吗？您可能对使用正则表达式的另一种解析方式感兴趣？
可能重复的问题，stackoverflow.com/questions/69284422/…

标签： javascript python html beautifulsoup

【解决方案1】：

我不会说这是比正则表达式更快的方法，但可能需要更少的代码行：
把这个字符串变成python dict

data_props = "{personName: 'Claudia', personCode:'123456'}"

data_as_dict_str = "dict(" + data_props[1:-1].replace(":", "=") + ")"

print(eval(data_as_dict_str))
# {'personName': 'Claudia', 'personCode': '123456'}

如果该属性包含恶意Python代码，将被执行（在eval中）！
而且我们也不能使用保险箱ast.literal_eval，因为它不允许调用名称dict

如果我们想使用ast.literal_eval 或json，那么我们需要将这个字符串转换成所有名字都用引号括起来的方式，此时只使用正则表达式会更容易：

import re

pattern = re.compile(r"(\b\w+\b)\s*:\s*'([^']+)'")

data_props = "{personName: 'Claudia', personCode:'123456'}"

print(dict(pattern.findall(data_props)))
# {'personName': 'Claudia', 'personCode': '123456'}

【讨论】：

@fazineroso 如果这个答案对您有帮助或您喜欢，请不要忘记vote up and mark the answer as a solution，我将不胜感激。

【解决方案2】：

如果问题只是处理不带引号的键，那么您可以使用 hjson 库。我不知道引擎盖下的效率，如何在解析器中使用正则表达式等，但在顶层使用它既好又简单：

import hjson

data = hjson.loads(soup.select_one('.class1')['data-prop'])

【讨论】：