【问题标题】:Train spacy model using custom input使用自定义输入训练 spacy 模型
【发布时间】:2019-09-12 17:53:28
【问题描述】:

这是我第一次尝试 spacy。 我有一个spacy训练数据,格式如下。

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Michael",
                "tag":"-",
                "ner":"U-PER"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Irwin",
                "tag":"-",
                "ner":"U-PER"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Jordan",
                "tag":"-",
                "ner":"U-PER"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"is",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"an",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"American",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"scientist",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Professor",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"at",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"the",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"University",
                "tag":"-",
                "ner":"U-ORG"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"of",
                "tag":"-",
                "ner":"U-ORG"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"California",
                "tag":"-",
                "ner":"U-ORG"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Berkeley",
                "tag":"-",
                "ner":"U-LOC"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"and",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"a",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"researcher",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"in",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"machine",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"learning",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"statistics",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"and",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"artificial",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"intelligence",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"",
                "tag":"",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  }
]

到目前为止,我看到的所有训练 spacy 模型 (https://spacy.io/usage/training#spacy-train-cli) 的示例都适用于以下类型的输入

有人可以举个例子来训练第一种形式的 sapcy 输入

【问题讨论】:

    标签: python-3.x spacy


    【解决方案1】:

    我最近更新了 IOB/NER 转换器并创建了一组 spacy convert -c iob 接受的示例输入,并以这种格式输出相应的训练数据:

    https://github.com/explosion/spaCy/tree/8ebc3711dc1ec065c39aeb6017d9ace129a28d3f/examples/training/ner_example_data

    更新的转换器将在下一个版本中发布,但如果您想尽快试用,可以从源代码安装 master 分支。

    【讨论】:

    • 谢谢。我可以看到 github.com/explosion/spaCy/blob/… 是所需的形式,github.com/explosion/spaCy/blob/… 使用它。我会试一试的。
    • 你有下一个版本的时间表吗?
    • 还有一件事,我可以从 github.com/explosion/spaCy/blob/… 的 cmets 中读到它是“针对 spaCy 2.0.6 开发和测试的”。我安装的版本是 2.1.8。不应该已经在了吗?
    • 一切都在为 2.2 升级,应该很快,但我不知道确切的时间。该训练脚本与从spacy convert 运行的格式转换器无关。 spacy 训练格式没有改变,只是从其他常见的 IOB/NER 格式到与spacy train 一起使用的 spacy 训练格式的转换器。如果您查看训练循环中的详细信息,TRAIN_DATA 在 NER 多任务脚本中的格式与 train_ner.py 脚本中的 TRAIN_DATA 格式不同。 (这令人困惑且不一致。希望将来会有更好的培训格式。)
    • 有一种将文本组织成"paragraphs" 并在每个单词上作为标记的注释的格式,通常称为spacy 训练格式,因为它与CLI 训练命令python -m spacy train 一起使用。另一种格式(原始问题中的第二个示例)通常是(text_string, annotation_dict) 元组的列表,它用于许多示例脚本中,因为对于不需要提供标签的实体等跨度注释来说,它更加紧凑每个单词。
    猜你喜欢
    • 2020-09-07
    • 1970-01-01
    • 2021-04-08
    • 2019-11-15
    • 2021-01-12
    • 2020-10-16
    • 2021-12-15
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多