图像到文本 - 在 python 2.7 中删除非 ascii 字符答案

【问题标题】：image to text - remove non-ascii chars in python 2.7图像到文本 - 在 python 2.7 中删除非 ascii 字符
【发布时间】：2014-07-24 15:43:53
【问题描述】：

我正在使用 pytesser 对小图像进行 OCR 并从中获取字符串：

image= Image.open(ImagePath)
text = image_to_string(image)
print text

但是，pytesser 有时喜欢识别并返回非 ascii 字符。当我现在想打印我刚刚识别的内容时，就会出现问题。在 python 2.7（这是我正在使用的）中，程序崩溃了。

有什么方法可以让 pytesser 不返回任何非 ascii 字符？也许您可以在 tesseract OCR 中更改某些内容？

或者，有什么方法可以测试一个字符串的非 ascii 字符（不会导致程序崩溃），然后不打印该行？

有些人会建议使用 python 3.4，但根据我的研究，pytesser 似乎无法使用它：Pytesser in Python 3.4: name 'image_to_string' is not defined?

【问题讨论】：

标签： python image-processing ocr tesseract python-tesseract

【解决方案1】：

我会选择Unidecode。该库将非 ASCII 字符转换为最相似的 ASCII 表示。

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

它应该可以完美运行！

【讨论】：

或者，如果用户想删除 unicode，他们可以关注这个帖子：stackoverflow.com/questions/15321138/…
给出了一个 TypeError: 'module' object is not callable。做了一个小改动。 from unidecode import unidecode

【解决方案2】：

有没有办法让 pytesser 不返回任何非 ascii 字符？

您可以使用选项tessedit_char_whitelist 来限制 tesseract 可识别的字符。

例如：

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

另请参阅：https://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

【讨论】：