无法将 PDF 转换为文本格式

【问题标题】：Unable To Convert PDF to Text format无法将 PDF 转换为文本格式
【发布时间】：2019-04-13 19:20:00
【问题描述】：

我在使用 pypdf2 解析 PDF 文件时遇到此错误我附上 PDF 以及错误。

I have attached the PDF to be parsed please click to view

谁能帮忙？

import PyPDF2


def convert(data):

   pdfName = data
   read_pdf = PyPDF2.PdfFileReader(pdfName)
   page = read_pdf.getPage(0)
   page_content = page.extractText()
   print(page_content)
   return (page_content)

错误：

PyPDF2.utils.PdfReadError: Expected object ID (8 0) does not match actual (7 0); xref table not zero-indexed.

【问题讨论】：

您的文件是扫描文件。您应该使用 OCR 功能从中获取文本。
你能给我发一份参考资料吗？

标签： python python-3.x python-2.7 pdf-parsing

【解决方案1】：

有一些开源 OCR 工具，例如 tesseract 或 openCV。

如果你想使用例如tesseract 有一个名为pytesseract 的python 包装库。

大多数 OCR 工具都适用于图像，因此您必须先将 PDF 转换为图像文件格式，例如 PNG 或 JPG。在此之后，您可以加载您的图像并使用 pytesseract 对其进行处理。

这里有一些示例代码如何使用 pytesseract，假设您已经将 PDF 转换为文件名为 pdfName.png 的图像：

from PIL import Image 
import pytesseract

def ocr_core(filename):  
    """
    This function will handle the core OCR processing of images.
    """
    text = pytesseract.image_to_string(Image.open(filename))  # We'll use Pillow's Image class to open the image and pytesseract to detect the string in the image
    return text

print(ocr_core('pdfName.png'))

【讨论】：