【问题标题】:parse XML in python without newline在没有换行符的python中解析XML
【发布时间】:2016-05-24 04:29:48
【问题描述】:

这里是xml文件 http://www.diveintopython3.net/examples/feed.xml

我的python代码:

from lxml import etree
def lxml():
    tree = etree.parse('feed.xml')
    NSMAP = {"nn":"http://www.w3.org/2005/Atom"}
    test = tree.xpath('//nn:category[@term="html"]/..',namespaces=NSMAP)
    for elem in tree.iter():
        print(elem.tag,'\t',elem.attrib)
    print('-------------------------------')
    test1 = tree.xpath('//nn:category',namespaces=NSMAP)
    print('++++++++++++++++++++++++++++++++')
    for node in test1:
        test2 = node.xpath('./../nn:summary',namespaces=NSMAP) # return a list
        print(test2.xpath('normalize-space(.)'))
    print('*****************************************')
    test3 = tree.xpath('//text()[normalize-space(.)]')# [normalize-space()] only remove the heading and tailing
    print(test3)

输出是:..

++++++++++++++++++++++++++++++++
['Putting an entire chapter on one page sounds\n    bloated, but consider this — my longest chapter so far\n    would be 75 printed pages, and it loads in under 5 seconds…\n    On dialup.']
['Putting an entire chapter on one page sounds\n    bloated, but consider this — my longest chapter so far\n    would be 75 printed pages, and it loads in under 5 seconds…\n    On dialup.']
['Putting an entire chapter on one page sounds\n    bloated, but consider this — my longest chapter so far\n    would be 75 printed pages, and it loads in under 5 seconds…\n    On dialup.']
['The accessibility orthodoxy does not permit people to\n      question the value of features that are rarely useful and rarely used.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
['These notes will eventually become part of a\n      tech talk on video encoding.']
*****************************************
['\n  ', 'dive into mark', '\n  ', 'currently between addictions', '\n  ', 'tag:diveintomark.org,2001-07-29:/', '\n  ', '2009-03-27T21:56:07Z', '\n  ', '\n  ', '\n  ', '\n    ', '\n      ', 'Mark', '\n      ', 'http://diveintomark.org/', '\n    ', '\n    ', 'Dive into history, 2009 edition', '\n    ', '\n    ', 'tag:diveintomark.org,2009-03-27:/archives/20090327172042', '\n    ', '2009-03-27T21:56:07Z', '\n    ', '2009-03-27T17:20:42Z', '\n    ', '\n    ', '\n    ', '\n  ', 'Putting an entire chapter on one page sounds\n    bloated, but consider this — my longest chapter so far\n    would be 75 printed pages, and it loads in under 5 seconds…\n    On dialup.', '\n  ', '\n  ', '\n    ', '\n      ', 'Mark', '\n      ', 'http://diveintomark.org/', '\n    ', '\n    ', 'Accessibility is a harsh mistress', '\n    ', '\n    ', 'tag:diveintomark.org,2009-03-21:/archives/20090321200928', '\n    ', '2009-03-22T01:05:37Z', '\n    ', '2009-03-21T20:09:28Z', '\n    ', '\n    ', 'The accessibility orthodoxy does not permit people to\n      question the value of features that are rarely useful and rarely used.', '\n  ', '\n  ', '\n    ', '\n      ', 'Mark', '\n    ', '\n    ', 'A gentle introduction to video encoding, part 1: container formats', '\n    ', '\n    ', 'tag:diveintomark.org,2008-12-18:/archives/20081218155422', '\n    ', '2009-01-11T19:39:22Z', '\n    ', '2008-12-18T15:54:22Z', '\n    ', '\n    ', '\n    ', '\n    ', '\n    ', '\n    ', '\n    ', '\n    ', '\n    ', 'These notes will eventually become part of a\n      tech talk on video encoding.', '\n  ', '\n']..

我的问题是为什么有这么多'\n'。如何删除它们?

另外一个问题是如何直接查询一个文本的标签,比如make获取“Mark”的节点(条目文本的子节点。

非常感谢

【问题讨论】:

  • 请不要将代码作为图片发布。将其作为文本发布,并正确格式化(突出显示/选择文本 -> 点击{})。谢谢
  • 我修好了。对不起,因为我是初学者,所以风格不好。谢谢

标签: python xml parsing xpath


【解决方案1】:

我的问题是为什么有这么多'\n'。如何删除它们?

XML 中的每个空格都将由您的 XPath 选择。格式良好的 XML 通常包含大量换行符和空格。例如,在下面的 XML 中,//text() 将选择两个空文本节点,即一个在<root><foo> 之间,另一个在</foo></root> 之间:

<root>
    <foo>bar</foo>
</root>

您可以使用//text()[normalize-space()] 来避免首先选择空文本节点。

"另外一个问题是如何直接查询一个文本的标签,比如make获取"Mark"的节点(entry的文本的子节点。"

your_text_node.getparent().tag

上面应该得到变量your_text_node引用的文本节点的父元素,然后返回元素的标签名。

【讨论】:

    【解决方案2】:

    \n 是一个转义序列。

    可以查看页面源码,发现bloated在换行符的开头。

    要删除它们,您可以使用string.replace()re.sub()

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-07-16
      • 2011-05-07
      • 1970-01-01
      • 1970-01-01
      • 2011-03-24
      • 2013-01-03
      • 2012-10-02
      • 2020-05-30
      相关资源
      最近更新 更多