使用 python etree 打印 xml 的嵌套元素答案

【问题标题】：Print nested element of xml with python etree使用 python etree 打印 xml 的嵌套元素
【发布时间】：2019-06-18 06:04:27
【问题描述】：

我正在尝试构建一个脚本来读取 xml 文件。这是我第一次解析 xml，我正在使用带有 xml.etree.ElementTree 的 python 进行解析。我要处理的文件部分如下所示：

    <component>
        <section>
                <id root="42CB916B-BB58-44A0-B8D2-89B4B27F04DF" />
                <code code="34089-3" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="DESCRIPTION SECTION" />
                <title mediaType="text/x-hl7-title+xml">DESCRIPTION</title>
                <text>
                        <paragraph>Renese<sup>®</sup> is designated generically as polythiazide, and chemically as 2<content styleCode="italics">H</content>-1,2,4-Benzothiadiazine-7-sulfonamide, 6-chloro-3,4-dihydro-2-methyl-3-[[(2,2,2-trifluoroethyl)thio]methyl]-, 1,1-dioxide. It is a white crystalline substance, insoluble in water but readily soluble in alkaline solution.</paragraph>
                        <paragraph>Inert Ingredients: dibasic calcium phosphate; lactose; magnesium stearate; polyethylene glycol; sodium lauryl sulfate; starch; vanillin. The 2 mg tablets also contain: Yellow 6; Yellow 10.</paragraph>
                </text>
                <effectiveTime value="20051214" />
        </section>
</component>    
<component>
        <section>
               <id root="CF5D392D-F637-417C-810A-7F0B3773264F" />
               <code code="42229-5" codeSystem="2.16.840.1.113883.6.1" codeSystemName="LOINC" displayName="SPL UNCLASSIFIED SECTION" />
               <title mediaType="text/x-hl7-title+xml">ACTION</title>
               <text>
                        <paragraph>The mechanism of action results in an interference with the renal tubular mechanism of electrolyte reabsorption. At maximal therapeutic dosage all thiazides are approximately equal in their diuretic potency. The mechanism whereby thiazides function in the control of hypertension is unknown.</paragraph>
                </text>
                <effectiveTime value="20051214" />
                </section>
</component>

完整文件可从以下网址下载：

https://dailymed.nlm.nih.gov/dailymed/getFile.cfm?setid=abd6ecf0-dc8e-41de-89f2-1e36ed9d6535&type=zip&name=Renese

这是我的代码：

import xml.etree.ElementTree as ElementTree
import re

with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
    xmlstring = f.read()

# Remove the default namespace definition (xmlns="http://some/namespace")
xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)

tree = ElementTree.fromstring(xmlstring)

for title in tree.iter('title'):
     print(title.text)

到目前为止，我可以打印标题，但我还想打印标签中捕获的相应文本。

我试过这个：

for title in tree.iter('title'):
     print(title.text)
     for paragraph in title.iter('paragraph'):
         print(paragraph.text)

但是我没有从paragraph.text 输出

在做

for title in tree.iter('title'):
         print(title.text)
         for paragraph in tree.iter('paragraph'):
             print(paragraph.text)

我打印段落的文本，但（显然）它是针对 xml 结构中找到的每个标题一起打印的。

我想找到一种方法来 1) 识别标题； 2) 打印相应的段落。我该怎么做？

【问题讨论】：

标签： python xml xml-parsing elementtree xml.etree

【解决方案1】：

如果你愿意使用lxml，那么下面是一个使用XPath的解决方案：

import re
from lxml.etree import fromstring


with open("ABD6ECF0-DC8E-41DE-89F2-1E36ED9D6535.xml") as f:
    xmlstring = f.read()

xmlstring = re.sub(r'\sxmlns="[^"]+"', '', xmlstring, count=1)

doc = fromstring(xmlstring.encode())  # lxml only accepts bytes input, hence we encode

for title in doc.xpath('//title'):  # for all title nodes
     title_text = title.xpath('./text()')  # get text value of the node

     # get all text values of the paragraph nodes that appear lower (//paragraph)
     # in the hierarchy than the parent (..) of <title>
     paragraphs_for_title = title.xpath('..//paragraph/text()')

     print(title_text[0] if title_text else '') 
     for paragraph in paragraphs_for_title: 
         print(paragraph)

【讨论】：