Python Beautifulsoup - 从标签和紧接其下方的标签中获取文本答案

【问题标题】：Python Beautifulsoup- get text from a tag and from the tag immediately below itPython Beautifulsoup - 从标签和紧接其下方的标签中获取文本
【发布时间】：2017-04-25 14:13:03
【问题描述】：

我有一个很长的文件，它经常重复使用标签。我需要任意数量的两种标签类型中的文本（尽管我不需要该类型的每个标签中的文本）。

这是xml文件的sn-p：

<key>category</key>
<string>Utilities</string>
<key>description</key>
<string></string>
<key>developer</key>
<string></string>
<key>display_name</key>
<string>PaperCut Client</string>
<key>icon_hash</key>
<string>0db77f1181a63838123e5b25607be0b9b7e32432d11ec3f370ddde1a7807f3fc</string>
<key>installer_item_hash</key>
<string>ebe1f3093bf20f0c6524e79005b37f932dcfe0166a0d740d985450e7a55f9ca0</string>
<key>installer_item_location</key>
<string>PCClient-13.5.dmg</string>
<key>installer_item_size</key>
<integer>45941</integer>
<key>installer_type</key>
<string>copy_from_dmg</string>
<key>installs</key>

我需要提取的是关键标签的文本，然后是紧随其后的字符串标签：

<key>'identifier'</key>
<string>'desired text'</string>

我可以返回所有的 display_name 标签：

soup.findAll('key', string="display_name")

但这会返回标签和字符串“display_name”。我只需要“display_name”，以及来自以下标签的文本（来自“string”标签的文本，例如“PaperCut Client”）。我怎样才能做到这一点？

【问题讨论】：

标签： python xml xml-parsing beautifulsoup

【解决方案1】：

xml = '''
<key>category</key>
<string>Utilities</string>
<key>description</key>
<string></string>
<key>developer</key>
<string></string>
<key>display_name</key>
<string>PaperCut Client</string>
<key>icon_hash</key>
<string>0db77f1181a63838123e5b25607be0b9b7e32432d11ec3f370ddde1a7807f3fc</string>
<key>installer_item_hash</key>
<string>ebe1f3093bf20f0c6524e79005b37f932dcfe0166a0d740d985450e7a55f9ca0</string>
<key>installer_item_location</key>
<string>PCClient-13.5.dmg</string>
<key>installer_item_size</key>
<integer>45941</integer>
<key>installer_type</key>
<string>copy_from_dmg</string>
<key>installs</key>'''
soup = BeautifulSoup(xml, 'lxml')
keys = soup.find_all('key', string='display_name')
for key in keys:
    string = key.next_sibling.next_sibling
    print(key.text)
    print(string.text)

出来：

display_name
PaperCut Client

【讨论】：

【解决方案2】：

如果key 和string 总是成对出现并保持相同的顺序（我想应该是这样，否则整个 xml 文件最终会陷入混乱），你可以这样做：

for key_tag, string_tag in zip(soup.find_all('key'), soup.find_all('string')):
    print key_tag.text, string_tag.text

【讨论】：

谢谢。这行代码很有用；然而，由于有这么多的字符串标签，这会过度选择字符串标签的文本，而不是只定位 'display_name' 字符串。