【问题标题】:Add some custom words to tokenizer in Spacy在 Spacy 中为分词器添加一些自定义词
【发布时间】:2019-11-04 01:46:16
【问题描述】:

我有一句话,希望看到如下预期的标记。

Sentence: "[x] works for [y] in [z]."
Tokens: ["[", "x", "]", "works", "for", "[", "y", "]", "in", "[", "z", "]", "."]
Expected: ["[x]", "works", "for", "[y]", "in", "[z]", "."]

如何通过自定义分词器功能做到这一点?

【问题讨论】:

  • 欢迎来到 Stackoverflow,请阅读help 了解如何提出一个好问题。这样,您的问题将得到社区用户的回答,并防止他们对您的问题投反对票。

标签: python tokenize spacy


【解决方案1】:

您可以从分词器前缀和后缀中删除[],这样括号就不会从相邻的分词中分离出来:

import spacy
nlp = spacy.load('en_core_web_sm')

prefixes = list(nlp.Defaults.prefixes)
prefixes.remove('\\[')
prefix_regex = spacy.util.compile_prefix_regex(prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

suffixes = list(nlp.Defaults.suffixes)
suffixes.remove('\\]')
suffix_regex = spacy.util.compile_suffix_regex(suffixes)
nlp.tokenizer.suffix_search = suffix_regex.search

doc = nlp("[x] works for [y] in [z].")
print([t.text for t in doc])
# ['[x]', 'works', 'for', '[y]', 'in', '[z]', '.']

相关文档在这里:

https://spacy.io/usage/linguistic-features#native-tokenizer-additions

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-04-24
    • 2017-05-01
    • 1970-01-01
    • 1970-01-01
    • 2022-10-08
    • 1970-01-01
    • 1970-01-01
    • 2019-06-09
    相关资源
    最近更新 更多