如何使用 python-docx 提取 docx 文档中的节号？答案

【问题标题】：How to extract section numbers in a docx document using python-docx?如何使用 python-docx 提取 docx 文档中的节号？
【发布时间】：2016-08-04 20:23:28
【问题描述】：

我有一个 docx 文档，该文档分为部分和小节，例如

A 部分

文本文本文本

文本文本文本

1.1 texttexttext

文本文本文本

(a) 文本文本文本

我想使用 python-docx 来提取文本。获取段落中的文本很容易，但我不知道如何获取部分标题的文本（例如“1.”和“(a)”等）。有没有简单的方法可以做到这一点？

【问题讨论】：

标签： python-docx

【解决方案1】：

这将取决于文档作者在构建文档时的严谨程度。

最好的情况是，作者对所有章节标题都使用了样式，然后您可以只解析段落，挑选出具有“标题 1”样式的段落。

for paragraph in document.paragraphs:
    if paragraph.style.name == 'Heading 1':
        print(paragraph.text)

如果作者改为使用粗体和字体大小等字符格式来指定标题，您的工作将变得更加困难，因为它们不太可能唯一标识标题。

【讨论】：

如果其中一个标题下方有一个表格怎么办？如何确定该表属于第一个表头？

【解决方案2】：

我建议你使用sections，如下例所示：

     document = Document()

     sections = document.sections

     sections

     <docx.parts.document.Sections object at 0x1deadbeef>

     len(sections)

     3

     section = sections[0]

     section

     <docx.section.Section object at 0x1deadbeef>
    for section in sections:

        print(section.start_type)

    NEW_PAGE (2)

    EVEN_PAGE (3)

    ODD_PAGE (4)

【讨论】：

我相信问题的作者实际上是在询问“标题”而不是 sections（如 MS Word 文档中所定义）。