获取两个封闭标签之间的文本 XML - Python答案

【问题标题】：Get text between two closed tags XML - Python获取两个封闭标签之间的文本 XML - Python
【发布时间】：2016-12-28 11:13:06
【问题描述】：

我下载了我的 Foursquare 数据，它采用 KML 格式。我正在使用 Python 将其解析为 XML 文件，但无法弄清楚如何获取封闭标签和封闭描述标签之间的文本。（这是我签到时输入的文本，在下面的示例中是“终于到了！与 Sonya 和 co”，但还有一个连字符）。

这是数据外观的示例。

<Placemark>
  <name>hummus grill</name>
  <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
  <updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
  <published>Tue, 24 Jan 12 17:14:00 +0000</published>
  <visibility>1</visibility>
  <Point>
    <extrude>1</extrude>
    <altitudeMode>relativeToGround</altitudeMode>
    <coordinates>-75.20104383595685,39.9528387056977</coordinates>
  </Point>
</Placemark>

到目前为止，我已经能够获取纬度/经度、发布日期、名称以及与代码类似的链接：

latitudes = []
longitudes = []

for d in dom.getElementsByTagName('coordinates'):
    #Break them up into latitude and longitude
    coords = d.firstChild.data.split(',')
    longitudes.append(float(coords[0]))
    latitudes.append(float(coords[1]))

我试过了（下面是数据开头有这个header的东西，还没想好怎么处理）

for d in dom.getElementsByTagName('description'):
    description.append(d.firstChild.data.encode('utf-8'))

<?xml version="1.0" encoding="UTF-8"?>
<kml><Folder><name>foursquare checkin history </name><description>foursquare checkin history </description>:

然后通过这个 d.firstChild.nextSibling.firstChild.data.encode('utf-8') 访问它，但它只是给了我“鹰嘴豆泥烧烤”，我假设它是 a 之间的文本标签（而不是来自名称标签）。

【问题讨论】：

标签： python xml

【解决方案1】：

以下对我有用：

In [44]: description = []

In [45]: for d in dom.getElementsByTagName('description'):
   ....:     description.append(d.firstChild.nextSibling.nextSibling.data.encode('utf-8'))
   ....:     

In [46]: description
Out[46]: ['- FINALLY HERE!! With Sonya and co']

或者，如果你想要描述标签中的整个文本：

from xml.dom.minidom import parse, parseString

def getText(node, recursive = False):
    """ 
    Get all the text associated with this node.
    With recursive == True, all text from child nodes is retrieved
    """
    L = ['']
    for n in node.childNodes:
        if n.nodeType in (dom.TEXT_NODE, dom.CDATA_SECTION_NODE):
            L.append(n.data)
        else:
            if not recursive:
                return None
        L.append(getText(n))
    return ''.join(L)

dom = parseString("""<Placemark>
  <name>hummus grill</name>
  <description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>
  <updated>Tue, 24 Jan 12 17:14:00 +0000</updated>
  <published>Tue, 24 Jan 12 17:14:00 +0000</published>
  <visibility>1</visibility>
  <Point>
    <extrude>1</extrude>
    <altitudeMode>relativeToGround</altitudeMode>
    <coordinates>-75.20104383595685,39.9528387056977</coordinates>
  </Point>
</Placemark>""")

description = []

for d in dom.getElementsByTagName('description'):
    description.append(getText(d, recursive = True))

print description

这将打印：[u'@hummus grill- FINALLY HERE!! With Sonya and co']

【讨论】：

【解决方案2】：

您是否尝试过使用子字符串？

例如，假设您的所有 xml 都在变量“foo”中。

foo = '<description>@<a href="https://foursquare.com/v/hummus-grill/4aab4f71f964a520625920e3">hummus grill</a>- FINALLY HERE!! With Sonya and co</description>'

您可以通过打印以下内容来提取此数据。

foo[foo.index('</a>')+4:foo.index('</description>')]

这应该会给你你想要的。

- FINALLY HERE!! With Sonya and co

只需阅读子字符串，您就可以更轻松地操作文本。

【讨论】：

那么我需要将 DOM 元素转换为子字符串吗？还是您建议完全不同的路线？
是的。将整个 DOM 元素设为一个变量将使您可以轻松地返回并挑选某些部分。子字符串往往是解析文本的一种简单方法。