Python Tesseract 西里尔字符问题答案

【问题标题】：Python Tesseract cyrillic characters problemPython Tesseract 西里尔字符问题
【发布时间】：2020-08-14 18:41:29
【问题描述】：

我正在尝试创建一个脚本，该脚本将使用 tesseract 突出显示图像中的特定单词。我的方法适用于大多数语言，除了带有西里尔字符（如俄语或希腊语）的语言。

For example usinng this image，当我使用 tesseract image_to_string 提取文本时，它会正确打印（见下文）

Extracted using image_to_string

但是当我尝试使用 tesseract data["text"] 处理图像并突出显示所需的文本时，我得到一个不包含西里尔字符的文本（见下文）

Example 1 data["text"]

Example 2 data["text"]

我知道 tesseract 已经对字符进行了编码，我尝试再编码一次，但得到了相同的结果。也许我的方法是错误的？

这是我的代码：

import cv2
import urllib

pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

image = cv2.imread("test_russian.png")

target_word = ["длинной"]

# Process image: morph and invert
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
processed = cv2.morphologyEx(gray_image, cv2.MORPH_OPEN, kernel, iterations=1)
inverted = 255 - processed

# Extract text
words_string = pytesseract.image_to_string(inverted, lang='rus', config='--psm 6')
print(f"Text extracted using image_to_string: \n {words_string}")

# Copy image to get data
image_copy = image.copy()
data = pytesseract.image_to_data(inverted, output_type=pytesseract.Output.DICT)

# Search for word
for word in target_word:
    print(f"\n from target word {word} and lowered {word.lower()} \n")
word_occurences = [i for i, word in enumerate(data["text"]) if word.lower() == word.lower() in target_word]

print("Text from data['text']: ")
for i, word in enumerate(data["text"]):
    print(f"I : {i} and word: {word}")

for occ in word_occurences:
    print(f"Occ: {occ}")
    w = data["width"][occ]
    h = data["height"][occ]
    l = data["left"][occ]
    t = data["top"][occ]
    p1 = (l + w, t + h)
    p2 = (l, t + h)

    image_copy = cv2.line(image_copy, p1, p2, color=(0, 60, 255), thickness=2)

# Resize images
image_copy = cv2.resize(image_copy, (920, 640))
gray_image = cv2.resize(gray_image, (920, 640))
inverted_image = cv2.resize(inverted, (920, 640))

# Show and save image
cv2.imshow("proccesed and inverted", inverted_image)
cv2.imshow("gray", gray_image)
cv2.imshow("identified text", image_copy)

cv2.imwrite("identified_text.png", image_copy)

cv2.waitKey(0)
``

【问题讨论】：

标签： python tesseract python-tesseract

【解决方案1】：

我在使用非拉丁字母文本时遇到了这个问题。我不知道它是如何解决的，但我将代码更改为：

OCR 文件首先使用与文件实际不同的语言。例如；将包含文件的希腊文本转换为包含俄罗斯文本。
对原始语言重复相同的过程。

也就是说，尝试转换文件而不是原始内容，然后更改为原始语言为我解决了这个问题。我的代码如下：

def FileExtrationOCR(Rawpdffile="Rawpdf.pdf",
                     Rawimagefile="Rawimage.jpg",
                     pagenotoextract=0,
                     OCRfile="Searchable.pdf",
                     lang=None):
    ...

    # Searchable PDF creation section
    pdf = pytesseract.image_to_pdf_or_hocr(Rawimagefile, extension='pdf', lang=lang)
    pdf_dict = pytesseract.image_to_data(Rawimagefile, output_type=pytesseract.Output.DICT, lang=lang)
    pdf_bytes = pytesseract.image_to_data(Rawimagefile, output_type=pytesseract.Output.BYTES, lang=lang)
    
    ....
   
a = FileExtrationOCR(Rawpdffile="Arm.pdf", pagenotoextract=0, lang="rus")

首先，我没有在“a”中传递“lang”参数然后如上传递参数。

【讨论】：

【解决方案2】：

这很可能是编码。您是否尝试将文本输出到文件。您可以使用不同的编码保存文件。您可以将文本输出到 .txt 文件，然后选择以“UTF8”编码保存。这通常显示西里尔字符。如果这没有帮助，可以使用具有更广泛编码范围的在线转换器。看看他们有什么好处。

【讨论】：