【问题标题】:How can I fix this n-gram extractor in Python?如何在 Python 中修复这个 n-gram 提取器?
【发布时间】:2020-02-01 08:06:04
【问题描述】:

我制作了一个 n-gram 提取器,可以从文本中提取组织的名称。但是,程序只提取第一个单词和最后一个单词的第一个字母。例如,如果短语"Sprint International Corporation" 出现在文本中,程序将返回"s corporation" 作为n-gram。你知道我做错了什么吗?我已经在下面发布了代码和输出。谢谢。

这是 n-gram 提取器的代码。

def org_ngram(classified_text):
    orgs = [c for c in classified_text if (c[1]=="ORGANIZATION")]
    #print(orgs)

    combined_orgs = []
    prev_org = False
    new_org = ("", "ORGANIZATION")
    for i in range(len(classified_text)):
        if classified_text[i][1] != "ORGANIZATION":
            prev_org = False
        else:
            if prev_org:
                new_org = new_org[0] + " " + classified_text[i][0].lower()
            else:
                combined_orgs.append(new_org)
                new_org = classified_text[i][0].lower()
            prev_org = True

    combined_orgs.append(new_org)
    combined_orgs = combined_orgs[1:]
    return combined_orgs

这是我分析的文本和我用来分析它的程序。

from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize

st = StanfordNERTagger('C:\\path\\english.all.3class.distsim.crf.ser.gz',
                       'C:\\Users\\path\\stanford-ner.jar',
                       encoding='utf-8')

text = "Trump met with representatives from Sprint International Corporation, Nike Inc, and Wal-Mart Company regarding the trade war."

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
orgs = org_ngram(classified_text)

print(orgs)

这是当前的输出。

['s corporation', 'n inc', 'w company']

这就是我想要输出的样子。

['sprint international corporation', 'nike inc', 'wal-mart company']

【问题讨论】:

    标签: python nlp nltk n-gram


    【解决方案1】:

    首先,避免使用StanfordNERTagger,它很快就会被弃用。改用这个Stanford Parser and NLTK

    >>> from nltk.parse import CoreNLPParser
    
    # Lexical Parser
    >>> parser = CoreNLPParser(url='http://localhost:9000')
    
    >>> ner_tagger = CoreNLPParser(url='http://localhost:9000', tagtype='ner')
    >>> list(ner_tagger.tag(('Rami Eid is studying at Stony Brook University in NY'.split())))
    [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'STATE_OR_PROVINCE')]
    

    一旦您获得了带有标记和 NER 标记的元组列表,您想要实现的任务是在给定特定标记类型的元组列表中获取连续的标记标记项,您可以尝试来自 @ 的解决方案987654322@

    from nltk import pos_tag
    from nltk.chunk import conlltags2tree
    from nltk.tree import Tree
    
    def stanfordNE2BIO(tagged_sent):
        bio_tagged_sent = []
        prev_tag = "O"
        for token, tag in tagged_sent:
            if tag == "O": #O
                bio_tagged_sent.append((token, tag))
                prev_tag = tag
                continue
            if tag != "O" and prev_tag == "O": # Begin NE
                bio_tagged_sent.append((token, "B-"+tag))
                prev_tag = tag
            elif prev_tag != "O" and prev_tag == tag: # Inside NE
                bio_tagged_sent.append((token, "I-"+tag))
                prev_tag = tag
            elif prev_tag != "O" and prev_tag != tag: # Adjacent NE
                bio_tagged_sent.append((token, "B-"+tag))
                prev_tag = tag
    
        return bio_tagged_sent
    
    
    def stanfordNE2tree(ne_tagged_sent):
        bio_tagged_sent = stanfordNE2BIO(ne_tagged_sent)
        sent_tokens, sent_ne_tags = zip(*bio_tagged_sent)
        sent_pos_tags = [pos for token, pos in pos_tag(sent_tokens)]
    
        sent_conlltags = [(token, pos, ne) for token, pos, ne in zip(sent_tokens, sent_pos_tags, sent_ne_tags)]
        ne_tree = conlltags2tree(sent_conlltags)
        return ne_tree
    
    def extract_ner(ne_tagged_sent):
        ne_tree = stanfordNE2tree(ne_tagged_sent)
    
        ne_in_sent = []
        for subtree in ne_tree:
            if type(subtree) == Tree: # If subtree is a noun chunk, i.e. NE != "O"
                ne_label = subtree.label()
                ne_string = " ".join([token for token, pos in subtree.leaves()])
                ne_in_sent.append((ne_string, ne_label))
        return ne_in_sent
    

    然后:

    ne_tagged_sent = [('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), 
    ('studying', 'O'), ('at', 'O'), ('Stony', 'ORGANIZATION'), 
    ('Brook', 'ORGANIZATION'), ('University', 'ORGANIZATION'), 
    ('in', 'O'), ('NY', 'LOCATION')]
    
    print(extract_ner(ne_tagged_sent))
    

    [出]:

    [('Rami Eid', 'PERSON'), ('Stony Brook University', 'ORGANIZATION'), ('NY', 'LOCATION')]
    

    【讨论】:

      猜你喜欢
      • 2018-08-10
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-12-07
      • 2016-05-02
      • 2018-08-12
      • 2012-03-06
      • 2012-01-05
      相关资源
      最近更新 更多