Spacy 在指定单词之前查找文本答案

【问题标题】：Spacy find text before specified wordSpacy 在指定单词之前查找文本
【发布时间】：2021-05-07 00:24:44
【问题描述】：

我正在处理句子 Hall is a Tony Award winner and Grammy nominee 并希望使用 spaCy Rule-Matcher 提取获得的奖项 (Tony Award)，但我似乎无法告诉 spaCy 查找符合以下条件的单词来到winner之前。那可能吗？如果是这样，那怎么办？

nlp = en_core_web_sm.load()

awards_lexical = [
            {'TEXT': {'REGEX': '\s*'}, 'OP': '*'},
            {'IS_PUNCT': True, 'OP': '*'},
            {'TEXT': {'REGEX': '^(winner|recipient)$'}},
            {'OP': '+'},
            ]
def matching(doc, pattern):
    result = []
    for sent in doc.sents:
        matcher = Matcher(nlp.vocab) 
        matcher.add("matching", None, pattern)  

        matches = matcher(nlp(str(sent))) 
        if len(matches)>0:
            match = matches[-1]
            span = sent[match[1]:match[2]] 
            result.append(span.text)

    return result

csv_reader = csv.reader(open('Matheus_Schmitz_hw02_bios.csv', encoding='utf-8'))
limit = 500
count = 0

open("hw2_lexical.jl", "w").close()
with open('hw2_lexical.jl', 'w') as hw2_lexical:
    for (idx, (url, bio)) in tqdm(enumerate(csv_reader), total=limit):
        count += 1
        result = {}
        result['url'] = url
        result['awards'] = matching(nlp(bio), awards_lexical)        
        hw2_lexical.write(str(result)+'\n')
        if count>=limit:
            break
        pass
    hw2_lexical.close()
print(count)

从我的代码中，我认为 spaCy 会在所选单词之前包含任何文本，但我所提供的所有变体都只是给我从获胜者|赢|奖励开始的文本，而不是之前的文本，这就是奖品名称最常见的是。

【问题讨论】：

您的代码不清楚，能否请您创建一个MCVE (minimal complete verifiable example)？ matching 是什么？另外，如果你想在winner 之前匹配一些东西，为什么不在你正在使用的模式中使用winner？注意won|awarded|award-winning 似乎在奖项名称之前，不是吗？
另外，您想将winner 条件“添加”到现有的awards_lexical 规则中，还是考虑在此处添加另一个模式？ winner 模式是什么？你怎么定义它（为了提取）？
我将获胜者模式定义为一个或多个大写单词后跟获胜者或收件人。这已经给了我一些提示，因为我没有考虑用大写单词过滤！
抱歉，我忘记回复了，但是是的！在适应了我的完整代码之后，效果很好。

标签： python regex nlp spacy

【解决方案1】：

您的想法似乎有效，您可以提取一个或多个大写单词，后跟winner 或recipient 使用

import spacy
from spacy.matcher import Matcher

text= "Hall is a Tony Award winner and Grammy nominee"
nlp = spacy.load("en_core_web_lg")
matcher = Matcher(nlp.vocab)
matcher.add("Winner", None, [{'POS': 'PROPN', 'OP':'+'}, {'TEXT': {'REGEX': '(?i)^(?:winner|recipient)$'}}])
doc = nlp(text)
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
for span in spacy.util.filter_spans(spans):
    print(span.text)
# => Tony Award winner

在模式中用作右侧标记的(?i)^(?:winner|recipient)$ 正则表达式以不区分大小写的方式匹配整个winner 或recipient 标记。

【讨论】：

非常感谢！一旦在我的代码中进行了调整，它就可以完美运行:)