使用正则表达式识别命名实体：NLTK答案

【问题标题】：Named Entity Recognition with Regular Expression: NLTK使用正则表达式识别命名实体：NLTK
【发布时间】：2014-08-15 10:02:21
【问题描述】：

我一直在玩 NLTK 工具包。我经常遇到这个问题并在网上搜索解决方案，但没有得到令人满意的答案。所以我把我的问题放在这里。

NER 很多时候不会将连续的 NNP 标记为一个 NE。我认为编辑 NER 以使用 RegexpTagger 也可以改进 NER。

例子：

输入：

巴拉克奥巴马是一个伟大的人。

输出：

Tree('S', [Tree('PERSON', [('Barack', 'NNP')]), Tree('ORGANIZATION', [('Obama', 'NNP')]), (' is', 'VBZ'), ('a', 'DT'), ('great', 'JJ'), ('person', 'NN'), ('.', '.')])

在哪里

输入：

前副总统迪克·切尼告诉保守派电台主持人劳拉·英格拉汉姆，他“很荣幸”在任期间与达斯·维德相提并论。

输出：

Tree('S', [('Former', 'JJ'), ('Vice', 'NNP'), ('总统', 'NNP'), Tree('NE', [('Dick ', 'NNP'), ('Cheney', 'NNP')]), ('told', 'VBD'), ('conservative', 'JJ'), ('radio', 'NN'), ( 'host', 'NN'), Tree('NE', [('Laura', 'NNP'), ('Ingraham', 'NNP')]), ('that', 'IN'), (' he', 'PRP'), ('', ''), ('was', 'VBD'), ('honored', 'VBN'), ("''", "''"), (' to', 'TO'), ('be', 'VB'), ('compared', 'VBN'), ('to', 'TO'), Tree('NE', [('Darth', 'NNP'), ('Vader', 'NNP')]), ('while', 'IN'), ('in', 'IN'), ('office', 'NN'), ('. ', '.')])

这里 Vice/NNP, President/NNP, (Dick/NNP, Cheney/NNP) 被正确提取。

所以我认为如果先使用 nltk.ne_chunk，然后如果两个连续的树是 NNP，那么两者都引用一个实体的可能性很高。

任何建议将不胜感激。我正在寻找我的方法中的缺陷。

谢谢。

【问题讨论】：

标签： regex nlp nltk named-entity-recognition

【解决方案1】：

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    if continuous_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)

    return continuous_chunk

txt = "Barack Obama is a great person." 
print get_continuous_chunks(txt)

[出]：

['Barack Obama']

但请注意，如果连续块不应该是单个 NE，那么您会将多个 NE 合并为一个。我想不出这样的例子，但我相信它会发生。但如果它们不连续，上面的脚本可以正常工作：

>>> txt = "Barack Obama is the husband of Michelle Obama."  
>>> get_continuous_chunks(txt)
['Barack Obama', 'Michelle Obama']

【讨论】：

感谢您提供漂亮的代码，但您是否发现在组合连续 NNP 以给出一个命名实体时有任何缺陷。
我暂时想不出一个例子，但我敢肯定会有连续的 NP 不应该是一个 NE。
感谢您的回答。我认为一类可能的例子将包含双及物动词，例如“他引用了米歇尔·巴拉克·奥巴马的话”，尽管这样的案例肯定非常罕见。
这句话虽然有点奇怪；P 也许这更自然“他引用了 Michelle 和 Barack Obama”
“巴拉克奥巴马做得很好吗？”返回“有没有巴拉克奥巴马”。你是怎么解决的？

【解决方案2】：

@alvas 的回答中有一个错误。栅栏错误。确保也在循环之外运行该 elif 检查，这样您就不会遗漏出现在句子末尾的 NE。所以：

from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue
    if current_chunk:
        named_entity = " ".join(current_chunk)
        if named_entity not in continuous_chunk:
            continuous_chunk.append(named_entity)
            current_chunk = []
    return continuous_chunk

txt = "Barack Obama is a great person and so is Michelle Obama." 
print get_continuous_chunks(txt)

【讨论】：

【解决方案3】：

@alvas 很好的答案。这真的很有帮助。我试图以更实用的方式捕获您的解决方案。不过还是要改进。

    def conditions(tree_node):
    return tree_node.height() == 2

    def coninuous_entities(self, input_text, file_handle):
      from nltk import ne_chunk, pos_tag, word_tokenize
      from nltk.tree import Tree

      # Note: Currently, the chunker categorizes only 2 'NNP' together.  
      docs = input_text.split('\n')
      for input_text in docs:
          chunked_data = ne_chunk(pos_tag(word_tokenize(input_text)))
          child_data = [subtree for subtree in chunked_data.subtrees(filter = self.filter_conditions)]

          named_entities = []
          for child in child_data:
              if type(child) == Tree:
                  named_entities.append(" ".join([token for token, pos in child.leaves()]))

          # Dump all entities to file for now, we will see how to go about that
          if file_handle is not None:
              file_handle.write('\n'.join(named_entities) + '\n')
      return named_entities

使用条件函数可以添加许多条件进行过滤。

【讨论】：