【问题标题】:Why am I unable to parse xml using BeautifulSoup?为什么我无法使用 BeautifulSoup 解析 xml?
【发布时间】:2020-04-18 21:54:41
【问题描述】:

我正在使用 BeautifulSoup 来解析我的 XML 文档。但是,适用于 HTML 的标准命令不适用于 XML(例如 soup.find_all() 方法)。为什么会这样?

from bs4 import BeautifulSoup

file =open("locations.xml",'r')
file_contents = file.read()
soup = BeautifulSoup(file_contents,'lxml')
elements = soup.find_all('image')        #gives out an empty list
print(soup.tag)     #prints my xml document

<image>
  <imageName>ryoungt_05.08.2002/aPICT0007.JPG</imageName>
  <resolution x="1280" y="960" />
  <taggedRectangles>
    <taggedRectangle x="322.0" y="806.0" width="228.0" height="122.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="427.0" y="452.0" width="259.0" height="55.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="722.0" y="721.0" width="67.0" height="77.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="355.0" y="549.0" width="383.0" height="88.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="317.0" y="706.0" width="380.0" height="118.0" offset="0.0" rotation="0.0" userName="admin" />
  </taggedRectangles>
</image>
<image>
  <imageName>ryoungt_05.08.2002/aPICT0010.JPG</imageName>
  <resolution x="1280" y="960" />
  <taggedRectangles>
    <taggedRectangle x="594.0" y="663.0" width="351.0" height="84.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="346.0" y="792.0" width="206.0" height="72.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="310.0" y="659.0" width="243.0" height="87.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="599.0" y="797.0" width="308.0" height="88.0" offset="0.0" rotation="0.0" userName="admin" />
  </taggedRectangles>
</image>

根据 BeautifulSoup 文档,一旦我安装了 lxml 解析器,一切都会正常工作。但为什么会这样呢?

【问题讨论】:

  • 这不是一个格式良好的 XML 文档:您需要一个 single root element。还有为什么要对XML使用HTML汤解析器,直接使用lxml
  • 您的代码运行良好。

标签: python xml beautifulsoup


【解决方案1】:

不确定您的 xml 文件的内容。但是对于您发布的内容,您的代码可以正常工作:

from bs4 import BeautifulSoup
file_contents = '''<image>
  <imageName>ryoungt_05.08.2002/aPICT0007.JPG</imageName>
  <resolution x="1280" y="960" />
  <taggedRectangles>
    <taggedRectangle x="322.0" y="806.0" width="228.0" height="122.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="427.0" y="452.0" width="259.0" height="55.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="722.0" y="721.0" width="67.0" height="77.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="355.0" y="549.0" width="383.0" height="88.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="317.0" y="706.0" width="380.0" height="118.0" offset="0.0" rotation="0.0" userName="admin" />
  </taggedRectangles>
</image>
<image>
  <imageName>ryoungt_05.08.2002/aPICT0010.JPG</imageName>
  <resolution x="1280" y="960" />
  <taggedRectangles>
    <taggedRectangle x="594.0" y="663.0" width="351.0" height="84.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="346.0" y="792.0" width="206.0" height="72.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="310.0" y="659.0" width="243.0" height="87.0" offset="0.0" rotation="0.0" userName="admin" />
    <taggedRectangle x="599.0" y="797.0" width="308.0" height="88.0" offset="0.0" rotation="0.0" userName="admin" />
  </taggedRectangles>
</image>'''
soup = BeautifulSoup(file_contents,'lxml')
elements = soup.find_all('image')        #gives out an empty list
print(elements)

输出:

[<image>
<imagename>ryoungt_05.08.2002/aPICT0007.JPG</imagename>
<resolution x="1280" y="960"></resolution>
<taggedrectangles>
<taggedrectangle height="122.0" offset="0.0" rotation="0.0" username="admin" width="228.0" x="322.0" y="806.0"></taggedrectangle>
<taggedrectangle height="55.0" offset="0.0" rotation="0.0" username="admin" width="259.0" x="427.0" y="452.0"></taggedrectangle>
<taggedrectangle height="77.0" offset="0.0" rotation="0.0" username="admin" width="67.0" x="722.0" y="721.0"></taggedrectangle>
<taggedrectangle height="88.0" offset="0.0" rotation="0.0" username="admin" width="383.0" x="355.0" y="549.0"></taggedrectangle>
<taggedrectangle height="118.0" offset="0.0" rotation="0.0" username="admin" width="380.0" x="317.0" y="706.0"></taggedrectangle>
</taggedrectangles>
</image>, <image>
<imagename>ryoungt_05.08.2002/aPICT0010.JPG</imagename>
<resolution x="1280" y="960"></resolution>
<taggedrectangles>
<taggedrectangle height="84.0" offset="0.0" rotation="0.0" username="admin" width="351.0" x="594.0" y="663.0"></taggedrectangle>
<taggedrectangle height="72.0" offset="0.0" rotation="0.0" username="admin" width="206.0" x="346.0" y="792.0"></taggedrectangle>
<taggedrectangle height="87.0" offset="0.0" rotation="0.0" username="admin" width="243.0" x="310.0" y="659.0"></taggedrectangle>
<taggedrectangle height="88.0" offset="0.0" rotation="0.0" username="admin" width="308.0" x="599.0" y="797.0"></taggedrectangle>
</taggedrectangles>
</image>]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-10-15
    • 2012-12-07
    • 2010-12-15
    • 1970-01-01
    • 2018-11-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多