从 .docx 文档的 xml 中提取数据答案

【问题标题】：extracting data from xml of .docx document从 .docx 文档的 xml 中提取数据
【发布时间】：2020-04-20 06:13:49
【问题描述】：

我需要提取标签之间的数据，如下所述。另外，如果数据对应相同的id，我想连接数据。

例如，根据下面的 XML，两个标签都在对应于相同 ID“00F1234A”的选项卡内因此需要提取“Hello World”。

xml_string="
<w:r w:rsid="00F1234A">     
    <w:rPr> 

    </w:rPr>
    <w:t>Hello</w:t>
</w:r>   


<w:r w:rsid="00F1234A">     
    <w:rPr> 

    </w:rPr>
    <w:t xml:space="preserve">World</w:t>
</w:r>"

目前，我正在使用以下正则表达式在标签之间提取数据

re.findall("<w:t>(.+?)</w:t>",xml_string)

这给了我 Hello，但不是 Hello World

我如何连接对应于相同id的数据，在这种情况下是“00F1234A”

【问题讨论】：

标签： python regex xml pandas

【解决方案1】：

为了解析它，您需要来自 XML (xmlns: x = "urn:something") 的命名空间。

使用 etrees 来提取值，而不是像这样的正则表达式：

 import xml.etree.ElementTree as ET
#parse XML string
tree = ET.fromstring('xml_string')

#declare namespace dictionary
nsmap = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

tagvalues = []
#loop through all w:t tags and append their values to list
for i in root.findall('.//w:r//w:t', nsmap):
    tagvalues.append(i.text)

#concatenate all values into a string
string  = ''
[string.join(word) for word in tagvalues]

也请查看this post。

【讨论】：

谢谢..但是这条线是做什么的？ ``` nsmap = {'w':'schemas.openxmlformats.org/wordprocessingml/2006/main'} ```
nsmap（命名空间映射）与命名空间变量相同。编辑了我的答案，以便使用它。将命名空间映射传递给 root.findall 允许您在每次有 w 时在 xpath 查询 './/w:t 中输入 uri schemas.openxmlformats... 来代替 w。 URI 确保 XML 命名空间在文档中是统一的，并且不能有重复的命名空间。文档 (docs.python.org/2/library/…) 和维基百科 (en.wikipedia.org/wiki/XML_namespace)。
这回答了你的问题吗？