使用 Tesseract OCR 从扫描的 pdf 文件夹中提取文本答案

【问题标题】：Use Tesseract OCR to extract text from a scanned pdf folders使用 Tesseract OCR 从扫描的 pdf 文件夹中提取文本
【发布时间】：2020-09-20 20:51:15
【问题描述】：

我有使用 Tesseract OCR 从扫描的 pdf 文件/普通 pdf 文件中提取/转换文本的代码。但我想让我的代码转换一个 pdf 文件夹而不是单个 pdf 文件，然后提取的文本文件将存储在我想要的文件夹中。

请参阅下面的代码：

filePath = '/Users/CodingStark/scanned/scanned-file.pdf'
pages = convert_from_path(filePath, 500)


image_counter = 1
  
# Iterate through all the pages stored above 
for page in pages: 
  
    filename = "page_"+str(image_counter)+".jpg"
          
    page.save(filename, 'JPEG') 
  
    image_counter = image_counter + 1
    

filelimit = image_counter-1
  
# Creating a text file to write the output 
outfile = "scanned-file.txt"
  

f = open(outfile, "a") 
  
# Iterate from 1 to total number of pages 
for i in range(1, filelimit + 1): 

    filename = "page_"+str(i)+".jpg"
          
    # Recognize the text as string in image using pytesserct 
    text = str(((pytesseract.image_to_string(Image.open(filename))))) 

    text = text.replace('-\n', '')     
  

    f.write(text) 
#Close the file after writing all the text. 
f.close()

我想自动化我的代码，以便它会转换扫描文件夹中的所有 pdf 文件，并且这些提取文本文件将位于我想要的文件夹中。另外，有什么方法可以删除代码后的所有 jpg 文件？因为它需要大量的内存空间。非常感谢！！

【问题讨论】：

您需要用 bash 或类似的方式编写一个 shell 脚本来执行此操作。或者你需要用 Python 或 Go 编写程序。我曾在一个项目中使用 Go 和 Tesseract OCR 来执行此操作。 JPG 不占用“内存空间”，它们会消耗存储空间。任务完成后即可移除。
@gorlok 谢谢，我会试试看！
嗨。最后写了吗？有的话可以分享一下吗？

标签： python pdf text tesseract python-tesseract

【解决方案1】：

这是从路径读取的循环，

import glob,os
import os, subprocess

pdf_dir = "dir"
os.chdir(pdf_dir)
for pdf_file in glob.glob(os.path.join(pdf_dir, "*.PDF")):
      //// put here what you want to do for each pdf file

【讨论】：