使用自定义输入训练 spacy 模型答案

【问题标题】：Train spacy model using custom input使用自定义输入训练 spacy 模型
【发布时间】：2019-09-12 17:53:28
【问题描述】：

这是我第一次尝试 spacy。我有一个spacy训练数据，格式如下。

[
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Michael",
                "tag":"-",
                "ner":"U-PER"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Irwin",
                "tag":"-",
                "ner":"U-PER"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Jordan",
                "tag":"-",
                "ner":"U-PER"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"is",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"an",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"American",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"scientist",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Professor",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"at",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"the",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"University",
                "tag":"-",
                "ner":"U-ORG"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"of",
                "tag":"-",
                "ner":"U-ORG"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"California",
                "tag":"-",
                "ner":"U-ORG"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"Berkeley",
                "tag":"-",
                "ner":"U-LOC"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"and",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"a",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"researcher",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"in",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"machine",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"learning",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"statistics",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"and",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"artificial",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"intelligence",
                "tag":"-",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  },
  {
    "id":0,
    "paragraphs":[
      {
        "sentences":[
          {
            "tokens":[
              {
                "orth":"",
                "tag":"",
                "ner":"O"
              }
            ]
          }
        ]
      }
    ]
  }
]

到目前为止，我看到的所有训练 spacy 模型 (https://spacy.io/usage/training#spacy-train-cli) 的示例都适用于以下类型的输入

有人可以举个例子来训练第一种形式的 sapcy 输入

【问题讨论】：

标签： python-3.x spacy

【解决方案1】：

我最近更新了 IOB/NER 转换器并创建了一组 spacy convert -c iob 接受的示例输入，并以这种格式输出相应的训练数据：

https://github.com/explosion/spaCy/tree/8ebc3711dc1ec065c39aeb6017d9ace129a28d3f/examples/training/ner_example_data

更新的转换器将在下一个版本中发布，但如果您想尽快试用，可以从源代码安装 master 分支。

【讨论】：

谢谢。我可以看到 github.com/explosion/spaCy/blob/… 是所需的形式，github.com/explosion/spaCy/blob/… 使用它。我会试一试的。
你有下一个版本的时间表吗？
还有一件事，我可以从 github.com/explosion/spaCy/blob/… 的 cmets 中读到它是“针对 spaCy 2.0.6 开发和测试的”。我安装的版本是 2.1.8。不应该已经在了吗？
一切都在为 2.2 升级，应该很快，但我不知道确切的时间。该训练脚本与从spacy convert 运行的格式转换器无关。 spacy 训练格式没有改变，只是从其他常见的 IOB/NER 格式到与spacy train 一起使用的 spacy 训练格式的转换器。如果您查看训练循环中的详细信息，TRAIN_DATA 在 NER 多任务脚本中的格式与 train_ner.py 脚本中的 TRAIN_DATA 格式不同。（这令人困惑且不一致。希望将来会有更好的培训格式。）
有一种将文本组织成"paragraphs" 并在每个单词上作为标记的注释的格式，通常称为spacy 训练格式，因为它与CLI 训练命令python -m spacy train 一起使用。另一种格式（原始问题中的第二个示例）通常是(text_string, annotation_dict) 元组的列表，它用于许多示例脚本中，因为对于不需要提供标签的实体等跨度注释来说，它更加紧凑每个单词。