【问题标题】:How to iterate over one more node in XML Python?如何在 XML Python 中再迭代一个节点?
【发布时间】:2020-08-10 18:04:24
【问题描述】:

我的 XML 结构如下:

"""<?xml version="1.0" encoding="utf-8"?>
<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""

我正在迭代作为 new_line 元素的子元素的文本元素,以加入具有相同 size 属性的标签。但我想指定 new_line 元素必须在 textbox 元素内。所以我也想遍历textbox。我尝试在我的代码中添加一个 for 循环,但它根本不起作用。代码如下:

import lxml.etree as etree

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse('output22.xml', parser)
root = tree.getroot()

# Iterate over //newline block
for new_line_block in tree.xpath('//new_line'):
    # Find all "text" element in the new_line block
    list_text_elts = new_line_block.findall('text')

    # Iterate over all of them with the current and previous ones
    for previous_text, current_text in zip(list_text_elts[:-1], list_text_elts[1:]):
        # Get size elements
        prev_size = previous_text.attrib.get('size')
        curr_size = current_text.attrib.get('size')
        # If they are equals and not both null
        if curr_size == prev_size and curr_size is not None:
            # Get current and previous text
            pt = previous_text.text if previous_text.text is not None else ""
            ct = current_text.text if current_text.text is not None else ""
            # Add them to current element
            current_text.text = pt + ct
            # Remove preivous element
            previous_text.getparent().remove(previous_text)



newtree = etree.tostring(root, encoding='utf-8', pretty_print=True)
#newtree = newtree.decode("utf-8")
print(newtree)
with open("output2.xml", "wb") as f:
    f.write(newtree)

我的预期输出:

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">API</text>
                <text size="12.482">TOLO</text>
                <text/>
                <text size="12.482">III</text>
                <text/>
            </new_line>
        </textbox>
    </page>
</pages>

现在我的代码不起作用,因为它加入一个标签然后跳过下一个标签,我认为没有指定 textbox 是问题所在。

【问题讨论】:

标签: python python-3.x xml lxml elementtree


【解决方案1】:

虽然你的问题和上一个类似,但这次的问题更加简单明了。您可以先提取数据,然后将其拼写成您想要的格式。这是一个例子。

从 simple_scrapy 导入 SimplifiedDoc、req、utils xml = """

<pages>
    <page>
        <textbox>
            <new_line>
                <text size="12.482">C</text>
                <text size="12.333">A</text>
                <text size="12.333">P</text>
                <text size="12.333">I</text>
                <text size="12.482">T</text>
                <text size="12.482">O</text>
                <text size="12.482">L</text>
                <text size="12.482">O</text>
                <text></text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text size="12.482">I</text>
                <text></text>
          </new_line>
        </textbox>
    </page>
</pages>
"""
doc = SimplifiedDoc(xml)
new_line = doc.new_line
lastSize = None
lst = []
texts = ""
for t in new_line.texts:
    if not lastSize or t.size==lastSize:
        texts += t.text
        lastSize = t.size
    else:
        lst.append((lastSize,texts))
        texts = t.text
        if t.size:
            lastSize = t.size
        else: 
            lst.append("<text />")
            lastSize=None
print(lst)

结果:

[('12.482', 'C'), ('12.333', 'API'), ('12.482', 'TOLO'), '<text />', ('12.482', 'III'), '<text />']

【讨论】:

    猜你喜欢
    • 2023-03-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-01-03
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多