使用 pytesseract 执行 OCR 时出错答案

【问题标题】：Error while performing OCR using pytesseract使用 pytesseract 执行 OCR 时出错
【发布时间】：2020-06-11 14:08:34
【问题描述】：

我想使用 pytesseract。这是我的代码。

import pytesseract 
from pdf2image import convert_from_path 

PDF_file = 'file.pdf'
text = '' 
pages = convert_from_path(PDF_file, 500)
pageText = str(((pytesseract.image_to_string(pages[0]))))

结果我得到了这个错误

Traceback（最近一次调用最后一次）：文件“C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py”，第 409 行，在 pdfinfo_from_path proc = Popen（命令，env=env，stdout=PIPE，stderr=PIPE） init 中的文件“C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py”，第 854 行 self._execute_child(args, 可执行文件, preexec_fn, close_fds, _execute_child 中的文件“C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\subprocess.py”，第 1307 行 hp, ht, pid, tid = _winapi.CreateProcess(executable, args, FileNotFoundError: [WinError 2] 系统找不到指定的文件

在处理上述异常的过程中，又发生了一个异常：

Traceback（最近一次调用最后一次）：文件“C:\Users\user\Desktop\projects\pdfparser\pdftest.py”，第 13 行，在 pages = convert_from_path（PDF_file，500）文件“C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py”，第 89 行，在 convert_from_path page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"] pdfinfo_from_path 中的文件“C:\Users\user\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pdf2image\pdf2image.py”，第 430 行引发 PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError：无法获取页数。 poppler 是否已安装并在 PATH 中？

【问题讨论】：

系统找不到file.pdf。它是否在您启动脚本的同一目录中？
是否安装了 poppler 并在路径中？
是的，在同一个目录下
但是路径中有 poppler 吗？
@NicolasGervais 不，我添加它并且它可以工作 thx

标签： python python-3.x ocr python-tesseract

【解决方案1】：

正如很多cmets已经指出的，错误信息

PDFInfoNotInstalledError( pdf2image.exceptions.PDFInfoNotInstalledError: 无法获取页数。是否已安装 poppler 并在 PATH 中？

准确告诉您出了什么问题：未安装 Poppler。请参阅README 以获取该方面的帮助。

你看，pdf2image 只是pdftoppm 命令行实用程序的包装。在 Linux 上它是默认安装的，所以你不需要打扰它，但在 Windows 上它不是。

【讨论】：