【发布时间】:2019-06-25 16:53:45
【问题描述】:
我想在我的自定义数据集上训练 spacy 命名实体识别器。我已经准备了一个 Python 字典,其中包含 key = entity_type 和值列表 = 实体名称,但我没有任何方法可以用正确的格式标记标记。
我尝试了正常的字符串匹配(查找)和正则表达式(搜索,编译),但没有得到我想要的。
例如:我的句子和我正在使用的字典是(这是示例)
sentence = "Machine learning and data mining often employ the same methods
and overlap significantly."
dic = {'MLDM': ['machine learning and data mining'], 'ML': ['machine learning'],
'DM': ['data mining']}
for k,v in dic.items():
for val in v:
if val in sentence:
print(k, val, sentence.index(val)) #right now I'm just printing
#the key, val and starting index
output:
MLDM machine learning and data mining 0
ML machine learning 0
DM data mining 21
expected output: MLDM 0 32
so I can further prepare training data to train Spacy NER :
[{"content":"machine learning and data mining often employ the same methods
and overlap significantly.","entities":[[0,32,"MLDM"]]}
【问题讨论】:
-
不懂python,但在检查句子之前将内容转换为小写。在大于 -1 的第一个索引之后,打破循环并使用索引和字符串长度构建结果对象。这样你应该得到你想要的结果
-
感谢@Michael 的评论,我想我尝试了你想说的,但如果我在同一个句子中有超过 1 或 2 个实体,那也不起作用。
标签: regex python-3.x spacy named-entity-recognition