【问题标题】:Reading pdf files with python 3.6使用 python 3.6 读取 pdf 文件
【发布时间】:2017-12-13 15:17:19
【问题描述】:

有没有办法用 python 3.6 打开和读取 pdf 文件?我尝试使用 PyPDF2 和 pdfrw 等几个库和工具读取 pdf 文件,但它们都不能提取 pdf 文档的文本内容。任何形式的帮助将不胜感激。

【问题讨论】:

    标签: python python-3.x pdf


    【解决方案1】:

    尝试: PyMuPDF

    Python 配方:PDF TEXT EXTRACTION USING FITZ / MUPDF (PYMUPDF):

        #!/usr/bin/env python
    """
    Created on Wed Jul 29 07:00:00 2015
    
    @author: Jorj McKie
    Copyright (c) 2015 Jorj X. McKie
    
    The license of this program is governed by the GNU GENERAL PUBLIC LICENSE
    Version 3, 29 June 2007. See the "COPYING" file of this repository.
    
    This is an example for using the Python binding PyMuPDF of MuPDF.
    
    This program extracts the text of an input PDF and writes it in a text file.
    The input file name is provided as a parameter to this script (sys.argv[1])
    The output file name is input-filename appended with ".txt".
    Encoding of the text in the PDF is assumed to be UTF-8.
    Change the ENCODING variable as required.
    -------------------------------------------------------------------------------
    """
    import fitz                 # this is PyMuPDF
    import sys, json
    
    ENCODING = "UTF-8"
    
    def SortBlocks(blocks):
        '''
        Sort the blocks of a TextPage in ascending vertical pixel order,
        then in ascending horizontal pixel order.
        This should sequence the text in a more readable form, at least by
        convention of the Western hemisphere: from top-left to bottom-right.
        If you need something else, change the sortkey variable accordingly ...
        '''
    
        sblocks = []
        for b in blocks:
            x0 = str(int(b["bbox"][0]+0.99999)).rjust(4,"0") # x coord in pixels
            y0 = str(int(b["bbox"][1]+0.99999)).rjust(4,"0") # y coord in pixels
            sortkey = y0 + x0                                # = "yx"
            sblocks.append([sortkey, b])
        sblocks.sort()
        return [b[1] for b in sblocks] # return sorted list of blocks
    
    def SortLines(lines):
        ''' Sort the lines of a block in ascending vertical direction. See comment
        in SortBlocks function.
        '''
        slines = []
        for l in lines:
            y0 = str(int(l["bbox"][1] + 0.99999)).rjust(4,"0")
            slines.append([y0, l])
        slines.sort()
        return [l[1] for l in slines]
    
    def SortSpans(spans):
        ''' Sort the spans of a line in ascending horizontal direction. See comment
        in SortBlocks function.
        '''
        sspans = []
        for s in spans:
            x0 = str(int(s["bbox"][0] + 0.99999)).rjust(4,"0")
            sspans.append([x0, s])
        sspans.sort()
        return [s[1] for s in sspans]
    
    #==============================================================================
    # Main Program
    #==============================================================================
    ifile = sys.argv[1]
    ofile = ifile + ".txt"
    
    doc = fitz.Document(ifile)
    pages = doc.pageCount
    fout = open(ofile,"w")
    
    for i in range(pages):
        pg_text = ""                                 # initialize page text buffer
        pg = doc.loadPage(i)                         # load page number i
        text = pg.getText(output = 'json')           # get its text in JSON format
        pgdict = json.loads(text)                    # create a dict out of it
        blocks = SortBlocks(pgdict["blocks"])        # now re-arrange ... blocks
        for b in blocks:
            lines = SortLines(b["lines"])            # ... lines
            for l in lines:
                spans = SortSpans(l["spans"])        # ... spans
                for s in spans:
                    # ensure that spans are separated by at least 1 blank
                    # (should make sense in most cases)
                    if pg_text.endswith(" ") or s["text"].startswith(" "):
                        pg_text += s["text"]
                    else:
                        pg_text += " " + s["text"]
                pg_text += "\n"                      # separate lines by newline
    
        pg_text = pg_text.encode(ENCODING, "ignore")
        fout.write(pg_text)
    
    fout.close()
    

    【讨论】:

      【解决方案2】:

      尝试使用pdfrw 0.4

      这里是链接:https://pypi.python.org/pypi/pdfrw

      【讨论】:

        猜你喜欢
        • 2019-11-11
        • 1970-01-01
        • 2015-09-27
        • 2018-01-22
        • 2018-02-23
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-08-15
        相关资源
        最近更新 更多