【问题标题】:How to set the FSM configuaration for Textricator PDF OCR reader?如何为 Textricator PDF OCR 阅读器设置 FSM 配置?
【发布时间】:2021-07-19 08:38:30
【问题描述】:

我正在尝试使用名为Textricator 的PDF 文档解析器。它可以使用 3 种不同的方法来解析带有一些常见 OCR 库的 PDF。 (itext5, itext7, pdfbox) 可用的方法有:texttableformText 用于普通的原始 OCR 识别,table 用于读取结构化表格数据,form 用于解析较少结构化的表单,使用 Finite状态机 (FSM)。

但是,我无法使用 form 解析器。也许我根本不明白如何组织许多配置状态。该文档缺少一个简单的表单示例,最近有人使用 form 方法发布了一个attempt to read a very basic table,但无法做到。我也试了一下,但没有成功。

问:谁能帮我在 YML 文件中配置状态机?
(这用于从该 repo 的 issues 之一解析演示文件,并显示在下面的复制屏幕截图中。)



YML 配置文件。


extractor: "pdf.pdfbox"

header:
  default: 100
footer:
   default: 600

maxRowDistance: 2

rootRecordType: item
recordTypes:
  item:
    label: "item"
    valueTypes:
      - item
      - date
      - description
      - order_number
      - quantity
      - price

valueTypes:
  item:
    label: "Item"
  date:
    label: "Date"
  description:
    label: "Description"
  order_number:
    label: "OrderNo"
  quantity:
    label: "Qty"
  price:
    label: "Price"
 
initialState: "INIT"

states:
  INIT:
    transitions:
      -
        condition: item
        nextState: item

  item:
    startRecord: true
    transitions:
      -
        condition: date
        nextState: date  

  date:
    include: true
    transitions:
      -
        condition: description
        nextState: description  

  description:
    include: true
    transitions:
      -
        condition: description
        nextState: description     
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  order_number:
    include: true
    transitions:
      -
        condition: order_number
        nextState: order_number
      -
        condition: quantity
        nextState: quantity

  quantity:
    include: true
    transitions:
      -
        condition: price
        nextState: price

  price:
    include: true
    transitions:
      -
        condition: end
        nextState: end

  end:
    include: false
    transitions:
      -
        condition: any
        nextState: end

conditions:

  item:         '73 < ulx < 110 and text =~ /(\\d)*/'
  date:         '110 < ulx < 181 and text =~ /([0-9\-]*)/'
  description:  '193 < ulx < 366'
#  order_number: '12 <= uly_rel <= 16 and text =~ ^.+/((\d{6})\-)((\d{2}))/'
  order_number: '12 <= uly_rel <= 16 and text =~ ^.+((\d{6})\-)((\d{2}))'
  quantity:     '393 < ulx < 459'
  price:        '459 < ulx < 523'

  end:          'text =~ /(Footer)/'
  any: "1 = 1"

您可能想知道为什么我在这个简单的示例中坚持使用 form 处理器,但这是因为在我的实际文档中,我将有一个更复杂的子项目子结构描述字段。这只能(?)由状态机 AFAIK 有效处理。

但是,也许这不是适合这项工作的工具?那么还有哪些其他选择呢?


更新:(2021-05-18)

Textricate 的作者现在已经修改了使用的库、文档并更正了几个工作示例和用户问题。感谢用户 mweber,我现在有了一个完美运行的解析器,不再需要使用 awkhandle weird columns

【问题讨论】:

    标签: itext ocr pdfbox text-extraction


    【解决方案1】:

    由于 Textricator 是一种用于解析 imo 的 pdf 的隐藏宝石,我很高兴看到有人使用它并将使用示例文档的配置发布到 github 问题:

    extractor: "pdf.pdfbox"
    
    header:
      default: 100
    footer:
      default: 600
    
    maxRowDistance: 2
    
    rootRecordType: item
    recordTypes:
      item:
        label: "item"
        valueTypes:
          - item
          - date
          - description
          - order_number
          - quantity
          - price
    
    valueTypes:
      item:
        label: "Item"
      date:
        label: "Date"
      description:
        label: "Description"
      order_number:
        label: "OrderNo"
      quantity:
        label: "Qty"
      price:
        label: "Price"
    
    initialState: "INIT"
    
    states:
      INIT:
        include: false
        transitions:
          -
            condition: item
            nextState: item
          - condition: any
            nextState: INIT
    
      item:
        startRecord: true
        transitions:
          -
            condition: date
            nextState: date  
    
      date:
        include: true
        transitions:
          -
            condition: description
            nextState: description  
    
      description:
        include: true
        transitions:
          -
            condition: description
            nextState: description     
          -
            condition: order_number
            nextState: order_number
          -
            condition: quantity
            nextState: quantity
          -
            condition: item
            nextState: item
    
      order_number:
        include: true
        transitions:
          -
            condition: order_number
            nextState: order_number
          -
            condition: quantity
            nextState: quantity
    
      quantity:
        include: true
        transitions:
          - 
            condition: price
            nextState: price
    
      price:
        include: true
        transitions:
          -
            condition: end
            nextState: end
          - 
            condition: description
            nextState: description
          -
            condition: item
            nextState: item
    
      end:
        include: false
        transitions:
          -
            condition: any
            nextState: end
    
    conditions:
    
      item:         '73 < ulx < 110 and text =~ /(\\d)*/'
      date:         '110 < ulx < 181 and text =~ /([0-9\\-]*)/'
      description:  '193 < ulx < 366'
      order_number: '12 <= uly_rel <= 16 and text =~ /^.+(([0-9]{6})\\-)(([0-9]{2}))/'
      quantity:     '393 < ulx < 459'
      price:        '459 < ulx < 523'
    
      end:          'text =~ /(Footer)/'
      any: "1 = 1"
    

    【讨论】:

    • 太棒了! SW 作者还在 recordTypes 下添加了filter: 'quantity &gt; 0'(带有 4 个空格缩进),在 valueTypesquantity 下添加了type: number
    猜你喜欢
    • 2011-10-09
    • 2011-11-02
    • 2020-02-21
    • 2014-01-31
    • 2012-02-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多