用于命名实体识别的 Spacy 交叉验证答案

【问题标题】：Cross Validation with Spacy for Named Entity Recognition用于命名实体识别的 Spacy 交叉验证
【发布时间】：2019-12-09 20:05:33
【问题描述】：

我正在尝试在 500 亿个样本上训练一个自定义 NER 模型。我正在使用 minibatch 进行 20 次迭代进行建模。我想了解我是否应该使用交叉验证来获得更准确的样本精度。如果是，那么交叉验证步骤应该在哪里进行？如果不是，那么我如何拆分/分布我的训练和测试数据，因为我使用注释和 6 个自定义实体，并且很难跟踪每个训练和测试数据中带注释标签的百分比，并均匀分布它.

这是我用于训练的代码 -

def train_spacy(data, iterations):
    TRAIN_DATA = data

    # create blank Language class
    nlp = spacy.blank('en')  

    # create the built-in pipeline components and add them to the pipeline
    # nlp.create_pipe works for built-ins that are registered with spaCy
    if 'ner' not in nlp.pipe_names:
        ner = nlp.create_pipe('ner')
        nlp.add_pipe(ner, last=True)

    # Add LABELS
    for _, annotations in TRAIN_DATA:
         for ent in annotations.get('entities'):
            ner.add_label(ent[2])

    # Get names of other pipes to disable them during training
    other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']

    # only train NER
    with nlp.disable_pipes(*other_pipes):  
        optimizer = nlp.begin_training()
        for itn in range(iterations):
            print("Starting iteration " + str(itn))

            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, 
                           drop=0.20,losses=losses)
            print('Losses', losses)

    return nlp


if __name__ == "__main__":

    # Train formatted data
    model = train_spacy(data, 10)

我认为交叉验证步骤应该在 for 循环内的某个地方进行迭代，但我不确定。有人可以说明如何使用 Spacy NER 进行交叉验证，还是根本不需要它？

【问题讨论】：

标签： python-3.x machine-learning nlp spacy

【解决方案1】：

理想情况下，您会将一部分训练数据集拆分为“开发集”，并使用该集中的所有实体来调整您的超参数。

如果您随机选择比例（确保不偏向日期或名称），您会期望实体的分布也大致相同。最好不要过度设计这个分割，而是取一个真正的随机样本。

【讨论】：