打开 NLP 名称查找器培训答案

【问题标题】：Open NLP Name Finder Training打开 NLP 名称查找器培训
【发布时间】：2013-05-20 11:43:08
【问题描述】：

我正在根据在线手册 (http://opennlp.apache.org/documentation/1.5.2-incubating/manual/opennlp.html) 构建一个 15k 行的训练数据文档，名为：en-ner-person.train .

我的问题是：在我的培训文档中，我是否包含完整的报告？还是我只包含具有名称的行：<START:person> John Smith <END>？

例如，我是否在训练数据中使用整个报告：

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
A nonexecutive  director has many similar responsibilities as an executive director.
However, there are no voting rights with this position.
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

还是我的培训文档中只包含这两行：

<START:person> Pierre Vinken <END> , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr . <START:person> Vinken <END> is chairman of Elsevier N.V. , the Dutch publishing group .

【问题讨论】：

标签： opennlp

【解决方案1】：

您应该使用整个报告。这将有助于系统了解何时不标记实体，从而提高误报率。

您可以使用evaluation tool 对其进行测量。保留语料库中的一些句子用于测试，例如总句子的 1/10，并使用其他 9/10 句子训练模型。您可以尝试使用整个报告进行训练，而另一个仅使用带有名称的句子进行训练。结果将以precision and recall 表示。

请记住将测试样本与整个报告一起保留，而不仅仅是带有名称的句子，否则您将无法准确衡量模型在没有名称的句子中的表现。

【讨论】：

【解决方案2】：

我会包括所有内容，即使所有这些内容可能对训练模型中的权重没有贡献。

训练文件中使用或不使用的内容取决于用于训练模型的特征生成器。如果您到了实际调整特征生成器的地步，那么如果它已经包含所有内容，那么您至少不需要重新构建您的训练文件。

文档中的这个示例功能生成器也恰好是用于名称查找器的代码中的默认生成器：Custom Feature Generation

AdaptiveFeatureGenerator featureGenerator = new CachedFeatureGenerator(
         new AdaptiveFeatureGenerator[]{
           new WindowFeatureGenerator(new TokenFeatureGenerator(), 2, 2),
           new WindowFeatureGenerator(new TokenClassFeatureGenerator(true), 2, 2),
           new OutcomePriorFeatureGenerator(),
           new PreviousMapFeatureGenerator(),
           new BigramNameFeatureGenerator(),
           new SentenceFeatureGenerator(true, false)
           });

我无法完全解释那段代码，也没有找到关于它的良好文档或涉足源代码以理解它，但那里的 WindowFeatureGenerators 考虑了令牌和令牌的类（例如，如果标记已被标记为人）在检查标记之前和之后的 +/-2 个位置。

因此，句子中不包含实体的标记可能会对包含实体的句子产生影响。通过裁剪多余的句子，您可能会使用不自然的模式来训练模型，例如以名称结尾的句子后跟以这样的名称开头的句子：

The car fell on <START:person> Pierre Vinken <END>. <START:person> Pierre Vinken<END> is the chairman.

【讨论】：