【问题标题】:How to tag named entities to prepare training data for custom named entity recognition with spacy?如何标记命名实体以准备训练数据以使用 spacy 进行自定义命名实体识别?
【发布时间】:2019-06-25 16:53:45
【问题描述】:

我想在我的自定义数据集上训练 spacy 命名实体识别器。我已经准备了一个 Python 字典,其中包含 key = entity_type 和值列表 = 实体名称,但我没有任何方法可以用正确的格式标记标记。

我尝试了正常的字符串匹配(查找)和正则表达式(搜索,编译),但没有得到我想要的。

例如:我的句子和我正在使用的字典是(这是示例)

sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."

dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
 'DM': ['data mining']}

for k,v in dic.items():
  for val in v:
    if val in sentence:
      print(k, val, sentence.index(val)) #right now I'm just printing 
#the key, val and starting index

output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21

expected output: MLDM 0 32

so I can further prepare training data to train Spacy NER : 
[{"content":"machine learning and data mining often employ the same methods 
and overlap significantly.","entities":[[0,32,"MLDM"]]}

【问题讨论】:

  • 不懂python,但在检查句子之前将内容转换为小写。在大于 -1 的第一个索引之后,打破循环并使用索引和字符串长度构建结果对象。这样你应该得到你想要的结果
  • 感谢@Michael 的评论,我想我尝试了你想说的,但如果我在同一个句子中有超过 1 或 2 个实体,那也不起作用。

标签: regex python-3.x spacy named-entity-recognition


【解决方案1】:

您可以从dic 中的所有值构建一个正则表达式,以将它们作为整个单词进行匹配,并在匹配时获取与匹配值关联的键。我假设值项在字典中是唯一的,它们可以包含空格并且仅包含“单词”字符(没有特殊字符,例如 +()。

import re

sentence = "Machine learning and data mining often employ the same methods and overlap significantly."

dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
 'DM': ['data mining']}

def get_key(val):
    for k,v in dic.items():
        if m.group().lower() in map(str.lower, v):
            return k
    return ''

# Flatten the lists in values and sort the list by length in descending order
l=sorted([v for x in dic.values() for v in x], key=len, reverse=True)
# Build the alternation based regex with \b to match each item as a whole word 
rx=r'\b(?:{})\b'.format("|".join(l))
for m in re.finditer(rx, sentence, re.I): # Search case insensitively
    key = get_key(m.group())
    if key:
        print("{} {}".format(key, m.start()))

Python demo

【讨论】:

  • 谢谢@Wiktor,这就是我希望你拯救我的一天。
猜你喜欢
  • 2019-12-16
  • 2021-05-06
  • 2017-09-15
  • 2019-02-06
  • 1970-01-01
  • 2014-01-25
  • 2023-03-24
  • 2011-10-20
  • 2015-05-24
相关资源
最近更新 更多