如何在 spacy 中标记新词汇？答案

【问题标题】：how to tokenize new vocab in spacy?如何在 spacy 中标记新词汇？
【发布时间】：2020-09-20 18:10:05
【问题描述】：

我正在使用 spacy 从它的依赖解析中受益，我在使 spcay 标记器标记我添加的新词汇时遇到了麻烦。这是我的代码：

nlp = spacy.load("en_core_web_md")

nlp.vocab['bone morphogenetic protein (BMP)-2']

nlp.tokenizer = Tokenizer(nlp.vocab)

text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

print([(token.text,token.tag_) for token in doc])

输出：

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone', 'NN'), ('morphogenetic', 'JJ'), ('protein', 'NN'), ('(BMP)-2', 'NNP'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(BMPRIB).', 'NN')]

欲望输出：

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NN'), ('for', 'IN'), ('BMP receptor type IB', 'NNP'), ('(', '('), ('BMPRIB', 'NNP'), (')', ')'), ('.', '.')]

如何让 spacy 标记我添加的新词汇？

【问题讨论】：

标签： python tokenize spacy vocabulary

【解决方案1】：

看看Doc.retokenize()是否对你有帮助：

import spacy
nlp = spacy.load("en_core_web_md")
text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

with doc.retokenize() as retokenizer:
    retokenizer.merge(doc[6:11])

print([(token.text,token.tag_) for token in doc])

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(', '-LRB-'), ('BMPRIB', 'NNP'), (')', '-RRB-'), ('.', '.')]

【讨论】：

@Leena 它回答你的问题了吗？有帮助吗？请考虑stackoverflow.com/help/someone-answers

【解决方案2】：

我在 nlp.tokenizer.tokens_from_list 中找到了解决方案我将我的句子分解成单词列表，然后它按照自己的意愿对其进行标记

导入空间

nlp = spacy.load("en_core_web_sm")

nlp.tokenizer = nlp.tokenizer.tokens_from_list

for doc in nlp.pipe([['This', 'study', 'describes', 'the', 'distributions', 'of', 'bone morphogenetic protein (BMP)-2', 'as', 'well', 'as', 'mRNAs', 'for','BMP receptor type IB', '(', 'BMPRIB', ')', '.']]):

文档中的令牌：

   print(token,'//',token.dep_)

【讨论】：