【问题标题】:spacy default english tokenizer changes when re-assignedspacy 默认英语标记器在重新分配时更改
【发布时间】:2021-04-23 09:42:55
【问题描述】:

当您在 spacy 的 (v3.0.5) 英语语言模型 en_core_web_sm 中分配分词器时,它自己的默认分词器会改变其行为。

您期望没有任何变化,但它默默地失败了。这是为什么呢?

要重现的代码:

import spacy

text = "don't you're i'm we're he's"

# No tokenizer assignment, everything is fine
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ['do', "n't", 'you', 'be', 'I', 'be', 'we', 'be', 'he', 'be']

# Default Tokenizer assignent, tokenization and therefore lemmatization fails
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ["don't", "you're", "i'm", "we're", "he's"]

【问题讨论】:

  • 我认为你应该 tyr:tokenizer =nlp.Defaults.create_tokenizer(nlp.vocab)
  • AttributeError: type object 'EnglishDefaults' has no attribute 'create_tokenizer' @NirElbaz

标签: python python-3.x spacy spacy-3


【解决方案1】:

要创建真正的默认分词器,必须将所有默认值传递给分词器类,而不仅仅是词汇:

from spacy.util import compile_prefix_regex, compile_suffix_regex, compile_infix_regex

rules = nlp.Defaults.tokenizer_exceptions
infix_re = compile_infix_regex(nlp.Defaults.infixes)
prefix_re = compile_prefix_regex(nlp.Defaults.prefixes)
suffix_re = compile_suffix_regex(nlp.Defaults.suffixes)

tokenizer = spacy.tokenizer.Tokenizer(
        nlp.vocab,
        rules = rules,
        prefix_search=prefix_re.search,
        suffix_search=suffix_re.search,
        infix_finditer=infix_re.finditer,
    )

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2022-06-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2023-03-06
    • 1970-01-01
    • 1970-01-01
    • 2021-03-29
    相关资源
    最近更新 更多