使用 Tesseract 对图像进行文本识别

【问题标题】：Text recognition of an image with Tesseract使用 Tesseract 对图像进行文本识别
【发布时间】：2021-11-09 16:00:00
【问题描述】：

我想从扫描的图像中创建一个带有文本识别功能的 pdf 文件。

但我不想要 PDF 文件中的原始图像，只是纯文本。文本应该是可见的，以便可以阅读，但字体并不重要。

这个 Tesseract 命令几乎可以满足我的要求，但是文本是不可见的。

tesseract -c textonly_pdf=1 test.tif test pdf

如何使文本可见？
我可以使用其他命令行或 python 工具创建 pdf 文件吗？

我在 Ubuntu 中运行 Tesseract。

【问题讨论】：

标签： linux ubuntu pdf tesseract text-recognition

【解决方案1】：

这里是我一年前在 python（在 Windows 上）中编写的脚本的 sn-p 代码，用于提取数据框中的文本（然后您可以将其保存为 csv 或其他格式）。

import cv2
import pytesseract as pya
pya.pytesseract.tesseract_cmd = r'D:\Programs\Tesseract_OCR\tesseract.exe'
from pytesseract import Output

imgcv = cv2.imread('foo.jpg')
# in text_df you have the extracted text, confidence and so on
text_df = pya.image_to_data(imgcv , output_type='data.frame')
text_df = text_df[text_df.conf != -1]
text_df = text_df[text_df.conf > 50]
conf = text_df['conf'].mean()

【讨论】：