用Python3.4提取PDF文本答案

【问题标题】：PDF text extract with Python3.4用Python3.4提取PDF文本
【发布时间】：2015-09-10 11:49:12
【问题描述】：

pdf 文件中的文本是文本格式，不是扫描的。 PDFMiner不支持python3，有没有其他解决方案？

【问题讨论】：

github.com/mstamy2/PyPDF2 ?
有3k版本的PDFMiner库：pypi.python.org/pypi/pdfminer3k

【解决方案1】：

还有 pdfminer2 fork，支持 python 3.4，可通过 pip3 获得。 https://github.com/metachris/pdfminer

This thread 帮我修补了一些东西。

from urllib.request import urlopen
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO, BytesIO

def readPDF(pdfFile):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)

    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(pdfFile, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    device.close()
    textstr = retstr.getvalue()
    retstr.close()
    return textstr

if __name__ == "__main__":
    #scrape = open("../warandpeace/chapter1.pdf", 'rb') # for local files
    scrape = urlopen("http://pythonscraping.com/pages/warandpeace/chapter1.pdf") # for external files
    pdfFile = BytesIO(scrape.read())
    outputString = readPDF(pdfFile)
    print(outputString)
    pdfFile.close()

【讨论】：

【解决方案2】：

对于python3，可以下载pdfminer为：

python -m pip install pdfminer.six

【讨论】：

【解决方案3】：

tika 最适合我。如果我说它比PyPDF2 和pdfminer 更好，那不会错。这使得将pdf中的每一行提取到一个列表中变得非常容易。您可以通过pip install tika 安装它并且，使用下面的代码：

from tika import parser
rawText = parser.from_file(path_to_pdf)
rawList = rawText['content'].splitlines()
print(rawList)

【讨论】：