【发布时间】:2020-08-14 18:41:29
【问题描述】:
我正在尝试创建一个脚本,该脚本将使用 tesseract 突出显示图像中的特定单词。 我的方法适用于大多数语言,除了带有西里尔字符(如俄语或希腊语)的语言。
For example usinng this image,当我使用 tesseract image_to_string 提取文本时,它会正确打印(见下文)
Extracted using image_to_string
但是当我尝试使用 tesseract data["text"] 处理图像并突出显示所需的文本时,我得到一个不包含西里尔字符的文本(见下文)
我知道 tesseract 已经对字符进行了编码,我尝试再编码一次,但得到了相同的结果。也许我的方法是错误的?
这是我的代码:
import cv2
import urllib
pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
image = cv2.imread("test_russian.png")
target_word = ["длинной"]
# Process image: morph and invert
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, 1))
processed = cv2.morphologyEx(gray_image, cv2.MORPH_OPEN, kernel, iterations=1)
inverted = 255 - processed
# Extract text
words_string = pytesseract.image_to_string(inverted, lang='rus', config='--psm 6')
print(f"Text extracted using image_to_string: \n {words_string}")
# Copy image to get data
image_copy = image.copy()
data = pytesseract.image_to_data(inverted, output_type=pytesseract.Output.DICT)
# Search for word
for word in target_word:
print(f"\n from target word {word} and lowered {word.lower()} \n")
word_occurences = [i for i, word in enumerate(data["text"]) if word.lower() == word.lower() in target_word]
print("Text from data['text']: ")
for i, word in enumerate(data["text"]):
print(f"I : {i} and word: {word}")
for occ in word_occurences:
print(f"Occ: {occ}")
w = data["width"][occ]
h = data["height"][occ]
l = data["left"][occ]
t = data["top"][occ]
p1 = (l + w, t + h)
p2 = (l, t + h)
image_copy = cv2.line(image_copy, p1, p2, color=(0, 60, 255), thickness=2)
# Resize images
image_copy = cv2.resize(image_copy, (920, 640))
gray_image = cv2.resize(gray_image, (920, 640))
inverted_image = cv2.resize(inverted, (920, 640))
# Show and save image
cv2.imshow("proccesed and inverted", inverted_image)
cv2.imshow("gray", gray_image)
cv2.imshow("identified text", image_copy)
cv2.imwrite("identified_text.png", image_copy)
cv2.waitKey(0)
``
【问题讨论】:
标签: python tesseract python-tesseract