【问题标题】:spacy remove only org and person namesspacy 仅删除组织和人员名称
【发布时间】:2021-10-07 01:28:40
【问题描述】:

我编写了以下函数,它从文本中删除所有命名实体。我如何修改它以仅删除组织和个人名称?我不想从下面的$6 中删除6。谢谢

import spacy
sp = spacy.load('en_core_web_sm')
def NER_removal(text):
    document = sp(text)
    
    text_no_namedentities = []
    
    ents = [e.text for e in document.ents]
    for item in document:
        if item.text in ents:
            pass
        else:
            text_no_namedentities.append(item.text)
    return (" ".join(text_no_namedentities))


NER_removal("John loves to play at Sofi stadium at 6.00 PM and he earns $6")
'loves to play at stadium at 6.00 PM and he earns $'

【问题讨论】:

    标签: python pandas nlp spacy


    【解决方案1】:

    我认为item.ent_type_ 在这里会很有用。

    import spacy
    sp = spacy.load('en_core_web_sm')
    def NER_removal(text):
        document = sp(text)
        text_no_namedentities = []
        # define ent types not to remove
        ent_types_to_stay = ["MONEY"]
        ents = [e.text for e in document.ents]
        for item in document:
            # add condition to leave defined ent types
            if all((item.text in ents, item.ent_type_ not in ent_types_to_stay)):
                pass
            else:
                text_no_namedentities.append(item.text)
        return (" ".join(text_no_namedentities))
    
    print(NER_removal("John loves to play at Sofi stadium at 6.00 PM and he earns $6"))
    # loves to play at Sofi stadium at 6.00 PM and he earns $ 6
    

    【讨论】:

    • 我们怎样才能避免出现额外的空间。例如在原始句子中它是$6,但我们的最终输出是$ 6
    • 您必须编写另一个条件来执行此操作。 SpaCy 的分词器在您的情况下将 $ 和 6 分开,因此当您致电 " ".join() 时,它们会分开
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-04-22
    • 2020-06-22
    • 1970-01-01
    • 1970-01-01
    • 2014-04-11
    • 2023-04-03
    • 1970-01-01
    相关资源
    最近更新 更多