使用 BeautifulSoup 解析 XML 时出现 Unicode 对象错误

【问题标题】：Unicode object error in parsing XML using BeautifulSoup使用 BeautifulSoup 解析 XML 时出现 Unicode 对象错误
【发布时间】：2014-04-24 09:06:38
【问题描述】：

使用 BeautifulSoup 解析 XML 输出中“name”标签的内容会出现以下错误：

AttributeError: 'unicode' object has no attribute 'get_text'

XML 输出：

<show>
  <stud>
    <__readonly__>
      <TABLE_stud>
        <ROW_stud>
          <name>rice</name>
          <dept>chem</dept>
          .
          .
          .
        </ROW_stud>
      </TABLE_stud>
    </__readonly__>
  </stud>
</show>

但是，如果我访问“部门”等其他标签的内容，它似乎工作正常。

stud_info = output_xml.find_all('row_stud')
for eachStud in range(len(stud_info)):

    print stud_info[eachStud].dept.get_text()   #Gives 'chem'
    print stud_info[eachStud].name.get_text()   #---Unicode Error---

任何 python/BeautifulSoup 专家可以帮我解决这个问题吗？（我知道 BeautifulSoup 不适合解析 XML。但只能说我不得不使用它）

【问题讨论】：

for eachStud in range(len(stud_info)) 是一种反模式，直接遍历 stud_info 的元素。

标签： python xml unicode beautifulsoup

【解决方案1】：

Tag.name 是一个包含标签名称的属性；这里的值是row_stud。

对包含标签的属性访问是.find(attributename) 的快捷方式，但仅在 API 中没有同名属性时才有效。请改用.find()：

print stud_info[eachStud].find('name').get_text()

您可以直接遍历stud_info结果列表，此处无需使用range()：

stud_info = output_xml.find_all('row_stud')
for eachStud in stud_info:
    print eachStud.dept.get_text()
    print eachStud.find('name').get_text()

我注意到您正在搜索小写的row_stud。如果您使用 BeautifulSoup 解析 XML，请确保您已安装 lxml，并告诉 BeautifulSoup 您正在处理的是 XML，这样它就不会对您的标签进行 HTML 化（小写）：

soup = BeautifulSoup(source, 'xml')

【讨论】：