【发布时间】:2015-02-24 15:10:44
【问题描述】:
我正在研究印刷文本的 OCR 识别。我特别关注预处理步骤以改进Tesseract 引擎的结果。 我已经通过自适应阈值、噪声消除、文本校正等获得了很好的结果......但是当其他商业产品返回不错的结果时,Tesseract 似乎仍然失败。
我使用了以下测试图像,这是使用 Tesseract 3.04 与两个商业 OCR api 相比获得的结果。所有 3 种服务都提供了相同的二进制图像,其中包含一些稍微模糊的文本。
Tesseract
Careers in Technology Consulting
Networking Lunch
21 m 2014, 11:00 - 14:30
Definingthecorporatellstmtegy, Wammmwdngdeal, creating
uniquebwinessisighnwilgbigdam-doesflismflxemmyouafioy?
Findoutmoreabanhowitfeektomkasatedlflogymbyjoiningour
for further mm please visit mAeloittexom/weers
ABBYY Fine Reader Online
Careers in Technology Consulting
Networking Lunch
21 November 2014,1140-14:30
Defining the corporate IT strategy, planning a multHnKon <Mar outsourcing deal, creating unique business insights using big data-doesthis sound Ifce something you enjoy?
Find out more about hour it feels to work as a technology consultant by joining our exclusive networking lunch,
For further information please visit wrwMuleloittexom/carcert
Careers in Technology Consulting Networking Lunch 21 November 2014, 11;00 —14:30
Defining the corporate IT strategy, planning a muiti-indlimi dollar outsourcing deal, creating unique business insights using big data—does this sound like something you enjoy?
Find out more about how it feels to work as a tedmology consultant by joining our exclusive networking lunch,
For further information' please visit wwwdeloitte,com/careers
现在我想知道 Tesseract 和其他两个产品之间的巨大差距是由于不同的引擎(肯定 ABBYY 使用自己的引擎,不确定 OCR Web 服务)还是可以完成一些其他预处理步骤在运行 Tesseract 之前。你有什么建议吗?
【问题讨论】:
标签: image-processing ocr tesseract motion-blur