【发布时间】:2019-12-09 20:05:33
【问题描述】:
我正在尝试在 500 亿个样本上训练一个自定义 NER 模型。我正在使用 minibatch 进行 20 次迭代进行建模。我想了解我是否应该使用交叉验证来获得更准确的样本精度。如果是,那么交叉验证步骤应该在哪里进行?如果不是,那么我如何拆分/分布我的训练和测试数据,因为我使用注释和 6 个自定义实体,并且很难跟踪每个训练和测试数据中带注释标签的百分比,并均匀分布它.
这是我用于训练的代码 -
def train_spacy(data, iterations):
TRAIN_DATA = data
# create blank Language class
nlp = spacy.blank('en')
# create the built-in pipeline components and add them to the pipeline
# nlp.create_pipe works for built-ins that are registered with spaCy
if 'ner' not in nlp.pipe_names:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner, last=True)
# Add LABELS
for _, annotations in TRAIN_DATA:
for ent in annotations.get('entities'):
ner.add_label(ent[2])
# Get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != 'ner']
# only train NER
with nlp.disable_pipes(*other_pipes):
optimizer = nlp.begin_training()
for itn in range(iterations):
print("Starting iteration " + str(itn))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(texts, annotations, sgd=optimizer,
drop=0.20,losses=losses)
print('Losses', losses)
return nlp
if __name__ == "__main__":
# Train formatted data
model = train_spacy(data, 10)
我认为交叉验证步骤应该在 for 循环内的某个地方进行迭代,但我不确定。有人可以说明如何使用 Spacy NER 进行交叉验证,还是根本不需要它?
【问题讨论】:
标签: python-3.x machine-learning nlp spacy