spacy - 模式匹配答案

【问题标题】：spacy - pattern matchingspacy - 模式匹配
【发布时间】：2020-12-05 22:40:25
【问题描述】：

我想尝试看看如何使用 spaCy 模式匹配来查找文本中引用的产品类别。我显然没有正确构建它。

我想将 CAT-POS-2299 标识为产品。我尝试了一些不同的变化。你将如何做到这一点，甚至可以寻找更通用的模式 CAT-???-???

也许我应该使用其他东西？

代码：

from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

matcher.add("Product", None,
            [{"LOWER": "CAT"},{"LOWER":"-"},{"LOWER":"POS"},{"LOWER":"-"},{"IS_DIGIT":True}]
           )

doc = nlp(" We have a new product CAT-POS-2299 that will be available to users soon.")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)```

【问题讨论】：

标签： python spacy

【解决方案1】：

如果您检查输入字符串是如何标记化的，您将看到 POS-2299 出现一个标记：

print([t.text for t in doc])
[' ', 'We', 'have', 'a', 'new', 'product', 'CAT', '-', 'POS-2299', 'that', 'will', 'be', 'available', 'to', 'users', 'soon', '.']

因此，如果您打算以不区分大小写的方式匹配 CAT 单词，然后匹配 - 标记，然后匹配任何仅 ASCII 字母的单词，后跟 - 和任何一个或多个数字，您可以使用

matcher.add("Product", None, [{"TEXT": {"REGEX": "(?i)CAT"}},{"TEXT":"-"},{"TEXT": {"REGEX": r"(?i)[A-Z]+-\d+"}}])
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(match_id, string_id, start, end, span.text)

# => 16898055450696666743 Product 6 9 CAT-POS-2299

由于您希望使模式更通用，我认为使用 REGEX 标记是有意义的。

注意：

(?i)CAT - 以不区分大小写的方式匹配 CAT
(?i)[A-Z]+-\d+ - 以不区分大小写的方式匹配任意一个或多个字母 ([A-Z]) ((?i))，然后匹配 -，然后匹配一个或多个数字 (\d+)。

【讨论】：

感谢维克托！我什至没有考虑看标记化