【发布时间】:2018-07-08 14:26:52
【问题描述】:
问题
我有以下Page01.htm
<!DOCTYPE html><html lang="it-IT"><head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=Edge"> <head><title>Title here</title></head>
<body>
<script id="TargetID" type="application/json"><![CDATA[
{ "name":"Kate", "age":22, "city":"Boston"}
]]>
</script><script id=“AnotherID” type="application/json"><![CDATA[{ "name":"John", "age":31, "city":"New York"}]]>
</script>
</body></html>
我想用ID=TargetID 提取脚本标签之间的JSON 中的信息。
我做了什么
我编写了以下 Python 3.6 代码:
from bs4 import BeautifulSoup
import codecs
page_path="/Users/me/Page01.htm"
page = codecs.open(page_path, "r", "utf-8")
soup = BeautifulSoup(page.read(), "lxml")
vegas = soup.find_all(id="TargetID")
invalid_tags = ['script']
soup = BeautifulSoup(str(vegas),"lxml")
for tag in invalid_tags:
for match in soup.findAll(tag):
match.replaceWithChildren()
JsonZ = str(soup)
现在,如果我查看 vegas 变量内部,我可以看到
[<script id="TargetID" type="application/json"><![CDATA[ {
> "name":"Kate", "age":22, "city":"Boston"} ]]> </script>]
但如果我尝试删除脚本标签(使用this answer 脚本),我会得到以下JsonZ 变量
'<html><body><p>[<![CDATA[\n{ "name":"Kate", "age":22, "city":"Boston"}\n]]>\n]</p></body></html>'
没有脚本标签但有另外 3 个标签 (<html><body><p>) 完全没用。
我的目标是让以下字符串 { "name":"Kate", "age":22, "city":"Boston"} 与 Python JSON 模块一起加载。
【问题讨论】:
标签: python json python-3.x beautifulsoup cdata