【问题标题】:how to tokenize new vocab in spacy?如何在 spacy 中标记新词汇?
【发布时间】:2020-09-20 18:10:05
【问题描述】:

我正在使用 spacy 从它的依赖解析中受益,我在使 spcay 标记器标记我添加的新词汇时遇到了麻烦。 这是我的代码:

nlp = spacy.load("en_core_web_md")

nlp.vocab['bone morphogenetic protein (BMP)-2']

nlp.tokenizer = Tokenizer(nlp.vocab)

text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'

doc = nlp(text)

print([(token.text,token.tag_) for token in doc])

输出:

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone', 'NN'), ('morphogenetic', 'JJ'), ('protein', 'NN'), ('(BMP)-2', 'NNP'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(BMPRIB).', 'NN')]

欲望输出:

[('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NN'), ('for', 'IN'), ('BMP receptor type IB', 'NNP'), ('(', '('), ('BMPRIB', 'NNP'), (')', ')'), ('.', '.')]

如何让 spacy 标记我添加的新词汇?

【问题讨论】:

    标签: python tokenize spacy vocabulary


    【解决方案1】:

    看看Doc.retokenize()是否对你有帮助:

    import spacy
    nlp = spacy.load("en_core_web_md")
    text = 'This study describes the distributions of bone morphogenetic protein (BMP)-2 as well as mRNAs for BMP receptor type IB (BMPRIB).'
    
    doc = nlp(text)
    
    with doc.retokenize() as retokenizer:
        retokenizer.merge(doc[6:11])
    
    print([(token.text,token.tag_) for token in doc])
    
    [('This', 'DT'), ('study', 'NN'), ('describes', 'VBZ'), ('the', 'DT'), ('distributions', 'NNS'), ('of', 'IN'), ('bone morphogenetic protein (BMP)-2', 'NN'), ('as', 'RB'), ('well', 'RB'), ('as', 'IN'), ('mRNAs', 'NNP'), ('for', 'IN'), ('BMP', 'NNP'), ('receptor', 'NN'), ('type', 'NN'), ('IB', 'NNP'), ('(', '-LRB-'), ('BMPRIB', 'NNP'), (')', '-RRB-'), ('.', '.')]
    

    【讨论】:

    【解决方案2】:

    我在 nlp.tokenizer.tokens_from_list 中找到了解决方案 我将我的句子分解成单词列表,然后它按照自己的意愿对其进行标记

    导入空间

    nlp = spacy.load("en_core_web_sm")

    nlp.tokenizer = nlp.tokenizer.tokens_from_list

    for doc in nlp.pipe([['This', 'study', 'describes', 'the', 'distributions', 'of', 'bone morphogenetic protein (BMP)-2', 'as', 'well', 'as', 'mRNAs', 'for','BMP receptor type IB', '(', 'BMPRIB', ')', '.']]):

    文档中的令牌:

       print(token,'//',token.dep_)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-04-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-01-28
      • 2020-04-18
      • 1970-01-01
      相关资源
      最近更新 更多