使 Spacy 分词器不在 / 上拆分答案

【问题标题】：Make Spacy tokenizer not split on /使 Spacy 分词器不在 / 上拆分
【发布时间】：2022-11-02 00:35:04
【问题描述】：

如何修改英文标记器以防止在 '/' 字符上拆分标记？

例如，以下字符串应该是一个标记：


import spacy

nlp = spacy.load('en_core_web_md')
doc = nlp("12/AB/568793")

for t in doc:
    print(f"[{t.pos_} {t.text}]")

# produces
#[NUM 12]
#[SYM /]
#[ADJ AB/568793]

【问题讨论】：

nlp 是什么？

标签： python nlp spacy

【解决方案1】：

该方法是从 Spacy 文档中删除 "Modifying existing rule sets" 中的规则的一种变体：


nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
assert(len([x for x in infixes if '/' in x])==1)  # there seems to just be one rule that splits on /'s
# remove that rule; then modify the tokenizer
infixes = [x for x in infixes if '/' not in x]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

【讨论】：

这个答案是一个很好的起点，但在技术上并不正确，因为包含 '/' 的规则实际上也涉及其他字符，包括 '='、'<'、'>' 等。简单地删除它会违反其他规则。所以，我建议修改而不是删除。由于评论的限制，我为详细代码打开了另一个答案。

【解决方案2】：

@Dave 的答案是一个很好的起点，但我认为正确的方法是修改而不是删除规则。

nlp = spacy.load('en_core_web_md')
infixes = nlp.Defaults.infixes
rule_slash = [x for x in infixes if '/' in x][0]
print(rule_slash)  # check the rule

您会看到该规则还涉及其他字符，包括 '='、'<'、'>' 等。

我们只从规则中删除“/”：

rule_slash_new = rule_slash.replace('/', '')
# replace the old rule with the new rule
infixes = [r if r!=rule_slash else rule_slash_new for r in infixes]
nlp.tokenizer.infix_finditer = spacy.util.compile_infix_regex(infixes).finditer

这样，在“A=B”或“A>B”等情况下，分词器仍将正确拆分。

【讨论】：