【问题标题】:xml parsing for this specific xml此特定 xml 的 xml 解析
【发布时间】:2015-10-19 14:19:40
【问题描述】:
    <instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
</context>
</instance>

我想提取里面的所有文本。这是我目前拥有的。 stuff.text 只打印&lt;head&gt;&lt;/head&gt; 之前的文本(即你知道吗...踩到),但我不知道如何提取&lt;/head&gt; 之后的后半部分(即它。使用...容易...应付。)

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print stuff.text

【问题讨论】:

    标签: python xml


    【解决方案1】:

    如果使用 BeautifulSoup 是一种选择,那将是微不足道的:

    import bs4
    xtxt = '''        <instance id="activate.v.bnc.00024693" docsrc="BNC">
        <answer instance="activate.v.bnc.00024693" senseid="38201"/>
        <context>
        Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
        </context>
        </instance>'''
    soup = bs4.BeautifulSoup(xtxt)
    print soup.find('context').text
    

    给予:

    Do you know what it is ,  and where I can get one ?  We suspect you had
    seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite 
    a hefty spade , with bicycle - type handlebars and a sprung lever at the 
    rear , which you step on to activate it . Used correctly ,  you shouldn't 
    have to bend your back during general digging ,  although it wo n't lift 
    out the soil and put in a barrow if you need to move it !  If gardening 
    tends to give you backache ,  remember to take plenty of rest periods 
    during the day ,  and never try to lift more than you can easily cope 
    with .  
    

    如果你更喜欢使用 ElementTree,你应该使用itertext 来处理所有的文本:

    import xml.etree.ElementTree as et
    tree = et.parse(os.getcwd()+"/../data/train.xml")
    instance = tree.getroot()
    
        for stuff in instance:
            if(stuff.tag == "answer"):
                print "the correct answer is %s" % stuff.get('senseid')
            if(stuff.tag == "context"):
                print dir(stuff)
                print ''.join(stuff.itertext())
    

    如果你确定你的 xml 文件是正确的,ElementTree 就足够了,因为它是标准 Python 库的一部分,你将没有外部依赖。但如果 XML 格式不正确,BeautifulSoup 非常适合修复小错误。

    【讨论】:

      【解决方案2】:

      可以使用元素序列化。有两种选择:

      • 保留内部&lt;head&gt;&lt;/head&gt;
      • 只返回不带任何标签的文本。

      如果使用标签进行序列化,可以手动删除外部&lt;context&gt;&lt;/context&gt;标签:

      # convert element to string and remove <context></context> tag
      print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>')))
      # read only text without any tags
      print(et.tostring(stuff, method='text'))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2017-12-29
        • 2020-10-16
        • 2014-02-05
        • 2021-04-04
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多