使用 OCR 从多个图像中提取文本到 CSV答案

【问题标题】：Extract text from multiple images to CSV using OCR使用 OCR 从多个图像中提取文本到 CSV
【发布时间】：2021-02-28 14:16:34
【问题描述】：

我想从数千张图片中提取文本并将其放入 CSV 文件中。谁能告诉我该怎么做？我的桌面上保存了图片。

【问题讨论】：

欢迎来到 StackOverflow，请使用 tour 了解什么是主题，什么不是主题，并查看 how to ask questions！谢谢:)
@alexzander 为什么不从这里开始呢？ tesseract-ocr.github.io/tessdoc/…
请问？我的代码有问题吗？
@alexzander 不，很抱歉造成误解，这是您对我的评论的评论的答案。 OP（从今天开始的成员）完全没有表现出他们尝试过的任何东西。这就是我指出规则的原因。

标签： python image ocr screen-scraping tesseract

【解决方案1】：

当然。

使用此命令安装 pytesseract 模块：

pip install pytesseract

从此 url 安装 tesseract 引擎可执行文件：

tesseract cmd 32 bit

或

tesseract cmd 64 bit

创建一个名为images_to_csv.py 的python 脚本并粘贴此代码：

import pytesseract
from PIL import Image # pip install Pillow

# set tesseract cmd to the be the path to your tesseract engine executable 
# (where you installed tesseract from above urls)

# IMPORTANT: this should end with '...\tesseract.exe'
pytesseract.pytesseract.tesseract_cmd = <path_to_your_tesseract_cmd>

# and start doing it

# your saved images on desktop
list_with_many_images = [
  "path1",
  "path2"
  # ...
  "pathN"
]

# create a function that returns the text
def image_to_str(path):
    """ return a string from image """
    return pytesseract.image_to_string(Image.open(path))

# now pure action + csv part
with open("images_content.csv", "w+", encoding="utf-8") as file:
  file.write("ImagePath, ImageText")
  for image_path in list_with_many_images:
    text = image_to_str(image_path)
    line = f"{image_path}, {text}\n"
    file.write(line)

这一切都只是开始。

如果您想使用模块csv，请继续。

享受吧。

【讨论】：