使用 Tesseract OCR 4.x 保留缩进答案

【问题标题】：Preserving indentation with Tesseract OCR 4.x使用 Tesseract OCR 4.x 保留缩进
【发布时间】：2020-04-22 05:17:29
【问题描述】：

我在 Tesseract OCR 上苦苦挣扎。我有一个血液检查图像，它有一个带有压痕的表格。尽管 tesseract 可以很好地识别字符，但其结构并未保留在最终输出中。例如，查看缩进的“Emocromo con formula”（英文翻译：blood count with formula）下面的行。我想保留那个缩进。

我阅读了其他相关讨论，并找到了preserve_interword_spaces=1 选项。结果稍微好一点，但正如你所见，它并不完美。

有什么建议吗？

更新：

我尝试了 Tesseract v5.0，结果是一样的。

代码：

Tesseract 版本为 4.0.0.20190314

from PIL import Image
import pytesseract

# Preserve interword spaces is set to 1, oem = 1 is LSTM, 
# PSM = 1 is Automatic page segmentation with OSD - Orientation and script detection

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'

# default_config = r'-c -l eng+ita'

extracted_text = pytesseract.image_to_string(Image.open('referto-1.jpg'), config=custom_config)

print(extracted_text)

# saving to a txt file

with open("referto.txt", "w") as text_file:
    text_file.write(extracted_text)

比较结果：

GITHUB：

如果您想自己尝试一下，我已经创建了一个GitHub 存储库。

感谢您的帮助和时间

【问题讨论】：

“使用 Tesseract 保留原始文本缩进/结构”：tesseract 无法保留原始结构 . Edit你的问题并解释你想用 ocred 数据做什么？
@stovfl 保存与原始文件结构相同的 txt 或 pdf。例如，查看缩进的“Emocromo con formula”（英文翻译：blood count with formula）下面的行。我想保留那个缩进。
“保存具有相同结构的 txt 或 pdf”：我假设您想要 开箱即用 解决方案？通常，您需要每个字符或字符组、图形和线/网格元素的coords。将Creating Snapshots 的输出添加到您的 GitHub 复制中。
@stovfl “我想你想要一个开箱即用的解决方案？”最好，如果有的话。保存到 Pdf 很简单，我做到了，相反，保存到具有相同缩进的 txt 文件并不像我想象的那么容易。
“如果有的话最好”：我不知道。 "to Pdf ...我做到了"：你如何获得缩进/制表符值？ "to a txt file"：视情况而定，纯文本只能使用\t 和<space>。 Textviewer 决定选项卡是否扩展为2, 4 or 8 <spaces。仅使用Monospaced 字体不会扭曲表格。意味着在一个 Textviewer 中查看 Table 显示正常，而在另一个则不会。

标签： python computer-vision ocr tesseract python-tesseract

【解决方案1】：

image_to_data() 函数提供了更多信息。对于每个单词，它将返回它的边界矩形。你可以用那个。

Tesseract 自动将图像分割成块。然后，您可以按块的垂直位置对块进行排序，对于每个块，您可以找到平均字符宽度（这取决于块的可识别字体）。然后对于块中的每个单词检查它是否接近前一个单词，如果不是则相应地添加空格。我使用pandas 来简化计算，但它的使用不是必需的。不要忘记结果应该使用等宽字体显示。

import pytesseract
from pytesseract import Output
from PIL import Image
import pandas as pd

custom_config = r'-c preserve_interword_spaces=1 --oem 1 --psm 1 -l eng+ita'
d = pytesseract.image_to_data(Image.open(r'referto-2.jpg'), config=custom_config, output_type=Output.DICT)
df = pd.DataFrame(d)

# clean up blanks
df1 = df[(df.conf!='-1')&(df.text!=' ')&(df.text!='')]
# sort blocks vertically
sorted_blocks = df1.groupby('block_num').first().sort_values('top').index.tolist()
for block in sorted_blocks:
    curr = df1[df1['block_num']==block]
    sel = curr[curr.text.str.len()>3]
    char_w = (sel.width/sel.text.str.len()).mean()
    prev_par, prev_line, prev_left = 0, 0, 0
    text = ''
    for ix, ln in curr.iterrows():
        # add new line when necessary
        if prev_par != ln['par_num']:
            text += '\n'
            prev_par = ln['par_num']
            prev_line = ln['line_num']
            prev_left = 0
        elif prev_line != ln['line_num']:
            text += '\n'
            prev_line = ln['line_num']
            prev_left = 0

        added = 0  # num of spaces that should be added
        if ln['left']/char_w > prev_left + 1:
            added = int((ln['left'])/char_w) - prev_left
            text += ' ' * added 
        text += ln['text'] + ' '
        prev_left += len(ln['text']) + added + 1
    text += '\n'
    print(text)

此代码将产生以下输出：

    ssseeess+ SERVIZIO SANITARIO REGIONALE                          Pagina 2 di3 
   seoeeeees EMILIA-RROMAGNA 
     ©2888   800 
     ©9868  6 006   :       pe   ‘  ‘        " 
     «ee @@e@ecee Azienda Unita Sanitaria Locale di Modena 
     Seat se  ces Amends Ospedaliero-Universitaria Policlinico di Modena 
         Dipartimento  interaziendale ad attivita integrata di Medicina di Laboratorio e Anatomia Patologica 
                                                  Direttore dr. T.Trenti 
                                           Ospedale Civile S.Agostino-Estense 
                                             S.C. Medicina  di Laboratorio 
                                           S.S. Patologia  Clinica - Corelab 
                            Sistema di Gestione per la Qualita certificato UNI EN ISO 9001:2015 
                                              Responsabile dr.ssa M.Varani 
        Richiesta (CDA):   49/073914                                    Data di accettazione: 18/12/2018 
                                                                        Data di check-in:    18/12/2018 10:27:06 
                                                                        Referto del          18/12/2018 16:39:53 
                                                                        Provenienza:         D4-cp sassuolo 

                                                           Sig. 
                                                           Data di Nascita: 
                                                           Domicilio: 
          ANALISI                                              RISULTATO  __UNITA'DI MISURA VALORI DI RIFERIMENTO 
       Glucosio                                                     95     mg/dl            (70  - 110 ) 
       Creatinina                                                 1.03     mg/dl            ( 0.50 - 1.40 ) 
       eGFR  Filtrato glomerulare stimato                         >60      ml/min           Cut-off per rischio di  I.R. 
             7                                                                              <60. Il calcolo é€ riferito 
       Equazione  CKD-EPI                                                                   ad una superfice corporea 
                                                                                            Standard  (1,73 mq)x In Caso 
                                                                                            di etnia afroamericana 
                                                                                            moltiplicare per  il fattore 
                                                                                            1,159. 
       Colesterolo                                                212   *  mg/dl            < 200 v.desiderabile 
       Trigliceridi                                                106     mg/dl            < 180 v.desiderabile 
       Bilirubina totale                                          0.60     mg/dl            ( 0.16 - 1.10 ) 
       Bilirubina diretta                                         0.10     mg/dl            ( 0.01 - 0.3 ) 
       GOT  - AST                                                   17     U/L              (1-37) 
       GPT  - ALT                                                   ay     U/L              (1-   40 ) 
       Gamma-GT                                                     15     U/L              (1-55) 
       Sodio                                                       142     mEq/L            ( 136 - 146 ) 
       Potassio                                                    4.3     mEq/L            (3.5  - 5.3) 
       Vitamina B12                                               342      pg/ml            ( 200 - 960 ) 
       TSH                                                        5.47  *  ulU/ml           (0.35  - 4.94 ) 
       FT4                                                         9.7     pg/ml            (7  = 15) 
       Urine chimico fisico morfologico 
          u-Colore                                     giallo paglierino 
          u-Peso specifico                                       1.012                      ( 1.010 - 1.027  ) 
          u-pH                                                     5.5                      (5.5  - 6.5) 
          u-Glucosio                                           assente     mg/dl            assente 
          u-Proteine                                           assente     mg/dl            (0  -10 ) 
          u-Emoglobina                                         assente     mg/dl            assente 
          u-Corpi chetonici                                    assente     mg/dl            assente 
          u-Bilirubina                                         assente     mg/dl            assente 
          u-Urobilinogeno                                         0.20     mg/dl            (0-   1.0 ) 
          sedimento                                    non significativo 
                                                                                          Il Laureato: 
                                                                                                     Dott. CRISTINA ROTA 
       Per ogni informazione o chiarimento sugli aspetti medici, puo rivolgersi al suo medico curante 
       Referto firmato elettronicamente secondo le norme vigenti: Legge 15 marzo 1997, n. 59; D.P.R. 10 novembre 1997, n.513; 
       D.P.C.M. 8 febbraio 1999; D.P.R 28 dicembre 2000, n.445; D.L. 23 gennaio 2002, n.10. 
       Certificato rilasciato da: Infocamere S.C.p.A. (http://www.card.infocamere. it) 
       i! Laureato: Dr. CRISTINA ROTA 
       1! documento informatico originale 6 conservato presso Parer - Polo Archivistico della Regione Emilia-Romagna

【讨论】：

太棒了。太完美了！
您能告诉我，如何创建与此文本文件结构相同的数据框（CSV 文件）吗？溶胶很棒
结果是一个普通的文本字符串，没有任何结构。您可以按原样保存，例如 with open('output.txt', 'w') as fout: fout.write(text) 。此处不需要数据框或 CSV。
如何将其保存为 .xlsx 文件并保留格式？
正如我在之前的评论中所说，结果是一个多行字符串并且没有任何结构。将其保存为 .xlsx 格式意味着要么将文本转储到单个单元格（或逐行），要么创建一些解析文本的逻辑，这超出了当前问题的范围。