如何使用 PyPDF2 提取 TOC？答案

【问题标题】：How can I extract the TOC with PyPDF2?如何使用 PyPDF2 提取 TOC？
【发布时间】：2018-01-08 19:53:42
【问题描述】：

以this pdf 为例。我可以用dumppdf.py -T 1707.09725.pdf 提取目录（TOC）：

<outlines>
    <outline level="1" title="1 Introduction">
        <dest>
            <list size="5">
                <ref id="513"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>14</pageno>
    </outline>
    <outline level="1" title="2 Convolutional Neural Networks">
        <dest>
            <list size="5">
                <ref id="554"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>16</pageno>
    </outline>
...

我可以用 PyPDF2 做类似的事情吗？

【问题讨论】：

标签： pdf pypdf2

【解决方案1】：

找到了：

from PyPDF2 import PdfFileReader

reader = PdfFileReader(open("1707.09725.pdf", 'rb'))

print(reader.outlines)

给予：

[{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
 {'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
 {'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
[{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
 {'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
 {'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
 {'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]

【讨论】：

为了进一步详细说明，您可以使用以下内容仅获取标题和页码。每@shawmat: def bookmark_dict(bookmark_list): result = {} for item in bookmark_list: if isinstance(item, list): # recursive call result.update(bookmark_dict(item)) else: try: result[reader.getDestinationPageNumber(item )+1] = item.title 除外：通过返回结果 reader = PyPDF2.PdfFileReader("[your filename]") print(bookmark_dict(reader.getOutlines()))

【解决方案2】：

或者，按照this answer 的建议，您可以使用pikepdf

from pikepdf import Pdf

path = "path/to/file.pdf"

with Pdf.open(path) as pdf:
    outline = pdf.open_outline()
    for title in outline.root:
        print(title)
        for subtitle in title.children:
            print('\t', subtitle)

【讨论】：