更新 spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match) 以便将主题标签标记为单个标记答案

【问题标题】：Updating spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match) so that hashtags are tokenized as a single token更新 spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match) 以便将主题标签标记为单个标记
【发布时间】：2021-06-18 15:02:45
【问题描述】：

这是我第一次使用 spacy，我正在尝试学习如何在其中一个预训练模型 (en_core_web_md) 上编辑标记器，以便在对推文进行标记时，整个主题标签变成一个标记（例如，我想要一个标记'#hashtagText'，默认是两个标记，'#' 和 'hashtagText'）。

我知道我不是第一个遇到这个问题的人。我曾尝试在其他地方在线实施建议，但在使用他们的方法后，输出保持不变（#hashtagText 是两个标记）。这些文章展示了我尝试过的方法。

https://the-fintech-guy.medium.com/spacy-handling-of-hashtags-and-dollartags-ed1e661f203c

https://towardsdatascience.com/pre-processing-should-extract-context-specific-features-4d01f6669a7e

如下代码所示，我的故障排除步骤是：

保存默认模式匹配正则表达式（default_token_matching_regex）
保存 nlp（预训练模型）在任何更新之前使用的正则表达式 (nlp_token_matching_regex_pre_update)

注意：我最初怀疑这些是相同的，但事实并非如此。输出见下文。

将我需要的正则表达式 (#\w+) 附加到 nlp 当前使用的列表中，将此组合保存为 updated_token_matching_regex
更新正则表达式 nlp 正在使用上面创建的变量 (updated_token_matching_regex)
保存 nlp 使用的新正则表达式以验证内容是否正确更新 (nlp_token_matching_regex_post_update)。

见下面的代码：

import spacy
import en_core_web_md
import re

nlp = en_core_web_md.load()

# Spacys default token matching regex.
default_token_matching_regex = spacy.tokenizer._get_regex_pattern(nlp.Defaults.token_match)

# Verify what regex nlp is using before changing anything.
nlp_token_matching_regex_pre_update = spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match)

# Create a new regex that combines the default regex and a term to treat hashtags as a single token. 
updated_token_matching_regex = f"({nlp_token_matching_regex_pre_update}|#\w+)"

# Update the token matching regex used by nlp with the regex created in the line above.
nlp.tokenizer.token_match = re.compile(updated_token_matching_regex).match

# Verify that nlp is now using the updated regex.
nlp_token_matching_regex_post_update = spacy.tokenizer._get_regex_pattern(nlp.tokenizer.token_match)

# Now let's try again
s = "2020 can't get any worse #ihate2020 @bestfriend <https://t.co>"
doc = nlp(s)

# Let's look at the lemmas and is stopword of each token
print(f"Token\t\tLemma\t\tStopword")
print("="*40)
for token in doc:
    print(f"{token}\t\t{token.lemma_}\t\t{token.is_stop}")

正如您在上面看到的，标记化行为与添加 '#\w+' 不同。有关所有故障排除变量的打印输出，请参见下文。

因为我觉得我已经在上面向自己证明我确实正确地更新了 nlp 正在使用的正则表达式，所以我能想到的唯一可能的问题是正则表达式本身是错误的。我自己测试了正则表达式，它的行为似乎符合预期，见下文：

在更新其 nlp.tokenizer.token_match 正则表达式以将其作为单个令牌执行之后，是否有人能够看到导致 nlp 将 #hashTagText 标记为两个令牌的错误？

谢谢！！

【问题讨论】：

标签： python nlp spacy tokenize hashtag

【解决方案1】：

不确定这是否是最好的解决方案，但我确实找到了一种使它起作用的方法。请参阅下面的内容：

Spacy 为我们提供了下面的图表，该图表显示了执行标记化时处理事物的顺序。

我能够使用 tokenizer.explain() 方法查看主题标签由于前缀规则而被撕掉。查看 tokenizer.explain() 输出很简单，只需运行以下代码，其中“first_tweet”是任意字符串。

tweet_doc = nlp.tokenizer.explain(first_tweet)
for token in tweet_doc: 
  print(token)

接下来，参考上面的图表，我们看到前缀规则是在标记化过程中应用的第一件事。

这意味着即使我使用允许将“#Text”保留为单个标记的正则表达式更新了 token_match 规则，但这并不重要，因为在评估 token_match 规则时，前缀规则已经分开文本中的“#”。

由于这是一个 twitter 项目，我永远不会希望将“#”视为前缀。因此我的解决方案是从考虑的前缀列表中删除“#”，这是通过以下代码完成的：

default_prefixes = list(nlp.Defaults.prefixes)
default_prefixes.remove('#')
prefix_regex = spacy.util.compile_prefix_regex(default_prefixes)
nlp.tokenizer.prefix_search = prefix_regex.search

就是这样！希望此解决方案对其他人有所帮助。

最后的想法：

最近 spacy 已更新到 3.0 版。我很好奇 spacy 预训练模型的先前版本是否在前缀列表中不包含“#”。这是我能想出的唯一解释，为什么之前发布的文章中显示的代码似乎不再按预期工作。如果有人能详细解释为什么我的解决方案似乎比我之前链接到的文章中的解决方案复杂得多，我当然很乐意学习。

干杯。

-布雷登

【讨论】：

图中的“异常”是指tokenizer.rules中的tokenizer异常，而不是tokenizer.token_match，它在任何其他模式之前都会被检查。

【解决方案2】：

英语的默认 token_match 是 None（从 v2.3.0 开始，现在 URL 模式在 url_match 中），所以你可以用你的新模式覆盖它：

import re
import spacy
nlp = spacy.blank("en")
nlp.tokenizer.token_match = re.compile("^#\w+$").match
assert [t.text for t in nlp("#asdf1234")] == ["#asdf1234"]

您在问题中的示例以模式 (None|#\w+) 结尾，这并不是您想要的，但对于这个使用 v2.3.5 和 v3.0.5 的示例来说，它似乎可以正常工作：

Token       Lemma       Stopword
========================================
2020        2020        False
ca      ca      True
n't     n't     True
get     get     True
any     any     True
worse       bad     False
#ihate2020      #ihate2020      False
@bestfriend     @bestfriend     False
<       <       False
https://t.co        https://t.co        False
>       >       False

【讨论】：