如何在 python 中阅读 pdf？ [复制]答案

【问题标题】：How can I read pdf in python? [duplicate]如何在 python 中阅读 pdf？ [复制]
【发布时间】：2018-01-29 09:49:22
【问题描述】：

如何在 python 中阅读 pdf？ 我知道一种将其转换为文本的方法，但我想直接从 pdf 中读取内容。

谁能解释一下python中的哪个模块最适合pdf提取

【问题讨论】：

标签： python python-2.7 pdf text-extraction

【解决方案1】：

你可以使用 PyPDF2 包

#install pyDF2
pip install PyPDF2

# importing all the required modules
import PyPDF2

# creating an object 
file = open('example.pdf', 'rb')

# creating a pdf reader object
fileReader = PyPDF2.PdfFileReader(file)

# print the number of pages in pdf file
print(fileReader.numPages)

关注此文档http://pythonhosted.org/PyPDF2/

【讨论】：

是否有解决方法来解决“PyPDF2.utils.PdfReadError: EOF marker not found”错误？
您并没有在这里真正说明如何获取 pdf 的实际文本。您的代码仅在 0x10d31f278> 处创建 .
PyPDF2、PyPDF3 和 PyPDF4 未维护。 I recommend to use pymupdf
尝试将此包裹与来自亚马逊的订单一起使用。它找到了 33 个页面，但所有页面的 extractText() API 都是空的
是的，我已经测试了一些 pdf，extractText() API 跳过了一些文本。它没有打印 pdf 中的所有文本。

【解决方案2】：

试试 PyPDF2。

这里有一个很好的教程：https://automatetheboringstuff.com/chapter13/

【讨论】：

【解决方案3】：

你可以在python中使用texttract模块

提取

用于安装

pip install textract

用于阅读 pdf

import textract
text = textract.process('path/to/pdf/file', method='pdfminer')

详情Textract

【讨论】：

据我所知，texttract 已损坏。
Textract 似乎也死了：github.com/deanmalmgren/textract/issues/350