使用python从PDF中提取扫描页面答案

【问题标题】：Extracting scanned pages from PDF using python使用python从PDF中提取扫描页面
【发布时间】：2018-11-05 18:08:52
【问题描述】：

我有很多PDF 文件，它们基本上是扫描文档，所以每一页都是一个扫描图像。我想执行OCR 并从这些文件中提取文本。我试过pytesseract，但它没有直接在pdf文件上执行OCR，所以我想从PDF文件中提取images，将它们保存在目录中，然后执行OCR直接在这些图像上使用pytesseract。 python有什么方法可以从pdf文件中提取扫描图像？或者有什么方法可以直接对pdf文件执行OCR？

【问题讨论】：

标签： python pdf

【解决方案1】：

此问题已在之前的 Stack Overflow 帖子中得到解决。

Converting PDF to images automatically
Converting a PDF to a series of images with Python

这是一个可能有用的脚本：https://nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html

另一种方法：https://www.daniweb.com/programming/software-development/threads/427722/convert-pdf-to-image-with-pythonmagick

提问前请查看之前的帖子。

编辑：

包括工作脚本以供将来参考。程序适用于 Windows 上的 Python3.6：

# coding=utf-8
# Extract jpg's from pdf's. Quick and dirty.

import sys

with open("Link/To/PDF/File.pdf", "rb") as file:
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

【讨论】：

我找不到任何适用于 Python 3.6 的方法。我在 Windows 上使用 Anaconda。
我刚刚从我链接到的示例脚本的 cmets 部分运行了代码 (nedbatchelder.com/blog/200712/extracting_jpgs_from_pdfs.html)。我能够让它在运行 Python 3.6 的 Windows 机器上运行。如果您仍有问题，请告诉我。
感谢您的努力。是的，这个工作正常。更新。
甜蜜！很高兴我能帮上忙！