根据[ReadThedocs.Python-DocX]: Style-related objects - _NumberingStyle objects,此功能尚未实现。
替代方案(至少其中一个)[PyPI]: docx2python 处理这些元素有点糟糕(主要是因为它返回所有转换为字符串的内容)。
因此,一个解决方案是手动解析 XML 文件 - 发现如何凭经验处理这个例子。一个好的文档位置是Office Open XML(我不知道它是否是所有处理 .docx 文件的工具(尤其是 MS Word)所遵循的标准):
- 从 word/document.xml 中获取每个段落(w:p 节点)
-
检查它是否是一个编号的项目(它有w:pPr -> w:numPr)子节点
-
获取w:numId的数字样式Id和等级:w:val属性 和 w:ilvl 子节点(上一个项目符号的节点)
-
将 2 个值与(在 word/numbering.xml 中)匹配:
-
w:abstractNum节点的w:abstractNumId属性
-
w:lvl子节点的w:ilvl属性
并获取对应w:numFmt和的w:val属性>w:lvlText 个子节点(注意项目符号也包括在内,它们可以根据上述 的 bullet 值进行区分w:numFmt 的属性)
不过,这似乎极其复杂,所以我提出了一种解决方法 (gainarie),它利用了 docx2python 的部分支持。
测试文档(sample.docx - 使用 LibreOffice 创建):
code00.py:
#!/usr/bin/env python
import sys
import docx
from docx2python import docx2python as dx2py
def ns_tag_name(node, name):
if node.nsmap and node.prefix:
return "{{{:s}}}{:s}".format(node.nsmap[node.prefix], name)
return name
def descendants(node, desc_strs):
if node is None:
return []
if not desc_strs:
return [node]
ret = {}
for child_str in desc_strs[0]:
for child in node.iterchildren(ns_tag_name(node, child_str)):
descs = descendants(child, desc_strs[1:])
if not descs:
continue
cd = ret.setdefault(child_str, [])
if isinstance(descs, list):
cd.extend(descs)
else:
cd.append(descs)
return ret
def simplified_descendants(desc_dict):
ret = []
for vs in desc_dict.values():
for v in vs:
if isinstance(v, dict):
ret.extend(simplified_descendants(v))
else:
ret.append(v)
return ret
def process_list_data(attrs, dx2py_elem):
#print(simplified_descendants(attrs))
desc = simplified_descendants(attrs)[0]
level = int(desc.attrib[ns_tag_name(desc, "val")])
elem = [i for i in dx2py_elem[0].split("\t") if i][0]#.rstrip(")")
return " " * level + elem + " "
def main(*argv):
fname = r"./sample.docx"
docd = docx.Document(fname)
docdpy = dx2py(fname)
dr = docdpy.docx_reader
#print(dr.files) # !!! Check word/numbering.xml !!!
docdpy_runs = docdpy.document_runs[0][0][0]
if len(docd.paragraphs) != len(docdpy_runs):
print("Lengths don't match. Abort")
return -1
subnode_tags = (("pPr",), ("numPr",), ("ilvl",)) # (("pPr",), ("numPr",), ("ilvl", "numId")) # numId is for matching elements from word/numbering.xml
for idx, (par, l) in enumerate(zip(docd.paragraphs, docdpy_runs)):
#print(par.text, l)
numbered_attrs = descendants(par._element, subnode_tags)
#print(numbered_attrs)
if numbered_attrs:
print(process_list_data(numbered_attrs, l) + par.text)
else:
print(par.text)
if __name__ == "__main__":
print("Python {:s} {:03d}bit on {:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")),
64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
输出:
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q066374154]> "e:\Work\Dev\VEnvs\py_pc064_03.09_test0\Scripts\python.exe" code00.py
Python 3.9.9 (tags/v3.9.9:ccb0e6a, Nov 15 2021, 18:08:50) [MSC v.1929 64 bit (AMD64)] 064bit on win32
Doc title
doc subtitle
heading1 text0
Paragr0 line0
Paragr0 line1
Paragr0 line2
space Paragr0 line3
a) aa (numbered)
heading1 text1
Paragrx line0
Paragrx line1
a) w tabs Paragrx line2 (NOT numbered – just to mimic 1ax below)
1) paragrx 1x (numbered)
a) paragrx 1ax (numbered)
I) paragrx 1aIx (numbered)
b) paragrx 1bx (numbered)
2) paragrx 2x (numbered)
3) paragrx 3x (numbered)
-- paragrx bullet 0
-- paragrx bullet 00
paragxx text
Done.
注意事项:
- 仅处理来自 word/document.xml 的节点(通过段落的 _element(LXML 节点)属性)
- 某些列表属性未被捕获(由于 docx2python 的限制)
- 这远非强大
-
descendants, simplified_descendants 可以大大简化,但我希望前者尽可能通用(如果需要扩展功能)