【问题标题】:How can I extract the TOC with PyPDF2?如何使用 PyPDF2 提取 TOC?
【发布时间】:2018-01-08 19:53:42
【问题描述】:

this pdf 为例。我可以用dumppdf.py -T 1707.09725.pdf 提取目录(TOC):

<outlines>
    <outline level="1" title="1 Introduction">
        <dest>
            <list size="5">
                <ref id="513"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>14</pageno>
    </outline>
    <outline level="1" title="2 Convolutional Neural Networks">
        <dest>
            <list size="5">
                <ref id="554"/>
                <literal>XYZ</literal>
                <number>99.213</number>
                <number>742.911</number>
                <null/>
            </list>
        </dest>
        <pageno>16</pageno>
    </outline>
...

我可以用 PyPDF2 做类似的事情吗?

【问题讨论】:

    标签: pdf pypdf2


    【解决方案1】:

    找到了:

    from PyPDF2 import PdfFileReader
    
    reader = PdfFileReader(open("1707.09725.pdf", 'rb'))
    
    print(reader.outlines)
    

    给予:

    [{'/Title': '1 Introduction', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(513, 0)},
     {'/Title': '2 Convolutional Neural Networks', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(554, 0)}, [{'/Title': '2.1 Linear Image Filters', '/Left': 99.213, '/Type': '/XYZ', '/Top': 486.791, '/Zoom': ..., '/Page': IndirectObject(554, 0)},
     {'/Title': '2.2 CNN Layer Types', '/Left': 70.866, '/Type': '/XYZ', '/Top': 316.852, '/Zoom': ..., '/Page': IndirectObject(580, 0)},
    [{'/Title': '2.2.1 Convolutional Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 562.722, '/Zoom': ..., '/Page': IndirectObject(608, 0)},
     {'/Title': '2.2.2 Pooling Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 299.817, '/Zoom': ..., '/Page': IndirectObject(654, 0)},
     {'/Title': '2.2.3 Dropout', '/Left': 99.213, '/Type': '/XYZ', '/Top': 742.911, '/Zoom': ..., '/Page': IndirectObject(689, 0)},
     {'/Title': '2.2.4 Normalization Layers', '/Left': 99.213, '/Type': '/XYZ', '/Top': 193.779, '/Zoom': <PyPDF2.generic.NullObject object at 0x7fbe49d14350>, '/Page': IndirectObject(689, 0)}]
    

    【讨论】:

    • 为了进一步详细说明,您可以使用以下内容仅获取标题和页码。每@shawmat: def bookmark_dict(bookmark_list): result = {} for item in bookmark_list: if isinstance(item, list): # recursive call result.update(bookmark_dict(item)) else: try: result[reader.getDestinationPageNumber(item )+1] = item.title 除外:通过返回结果 reader = PyPDF2.PdfFileReader("[your filename]") print(bookmark_dict(reader.getOutlines()))
    【解决方案2】:

    或者,按照this answer 的建议,您可以使用pikepdf

    from pikepdf import Pdf
    
    path = "path/to/file.pdf"
    
    with Pdf.open(path) as pdf:
        outline = pdf.open_outline()
        for title in outline.root:
            print(title)
            for subtitle in title.children:
                print('\t', subtitle)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2013-12-18
      • 2016-09-22
      • 1970-01-01
      • 2021-05-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多