传递 pdf 文件目录以执行 OCR，并为 Python 中的每个转换文件生成 .txt 文件答案

【问题标题】：Pass a directory of pdf files for performing OCR and generate .txt files for each converted file in Python传递 pdf 文件目录以执行 OCR，并为 Python 中的每个转换文件生成 .txt 文件
【发布时间】：2019-10-16 01:51:18
【问题描述】：

我有一个包含 pdf 文件的目录。当您将文件名传递给 wand.image 类的对象时，我编写了执行 OCR 的代码。我现在要做的是遍历pdf文件的目录并为每个pdf生成一个OCR'd txt文件并将其保存在某个目录中。我写到现在的代码如下：

import io
from PIL import Image
import pytesseract
from wand.image import Image as wi




pdf = wi(filename = r"D:\files\aba7d525-04b8-4474-a40d-e94f9656ed42.pdf", resolution = 300)

pdfImg = pdf.convert('jpeg')

imgBlobs = []

for img in pdfImg.sequence:
    page = wi(image = img)
    imgBlobs.append(page.make_blob('jpeg'))

extracted_text = []

for imgBlob in imgBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang = 'eng')
    extracted_text.append(text)

print(extracted_text[0])

关于如何从 OCR'd pdf 生成 .txt 文件的任何建议

【问题讨论】：

标签： python loops pdf file-handling python-tesseract

【解决方案1】：

在代码末尾试试这个：

with open('filename.txt', 'w') as result:
     for line in extracted_text:
          result.write(line,'\n')

【讨论】：

问题是，如果您看到我的代码（“pdf = ..”），我在代码中硬编码了一个文件名，但我需要在那里传递一个目录，以便该目录中的所有文件可以进行 OCR 处理，而且我需要将所有这些文件及其文件名作为输出，仅将 .pdf 替换为 .txt。我该怎么做