从文本中提取国籍和国家答案

【问题标题】：Extracting nationalities and countries from text从文本中提取国籍和国家
【发布时间】：2016-10-19 14:02:11
【问题描述】：

我想使用 nltk 从文本中提取所有提到的国家和国籍，我使用 POS 标记来提取所有带有 GPE 标记的标记，但结果并不令人满意。

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")

得到的结果是：

['Thyroid', 'Australian', 'Caucasian', 'Graves']

有些是国籍，有些只是名词。

那么我做错了什么还是有其他方法可以提取此类信息？

【问题讨论】：

你没有错。您执行了实体提取，然后获取实体块并在其中搜索 GPE 标签。您对 NLTK 结果不满意的原因是 NLTK 通常在实体分类方面表现不佳。有可用于 GPE 的查找表。它们非常全面且非常有效。使用它们而不是依赖 NLTK。
谢谢，你能给我一个那些查找表的例子吗...

标签： python nlp nltk pos-tagger

【解决方案1】：

您可以将 Spacy 用于 NER。它提供了比 NLTK 更好的结果。

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"Apple is opening its first big office in San Francisco and California.")
print([(ent.text, ent.label_) for ent in doc.ents])

【讨论】：

【解决方案2】：

这是使用 NLTK 执行实体提取的 geograpy。它将所有地点和地点存储为地名词典。然后它在地名词典上执行查找以获取相关地点和位置。查看文档了解更多使用详情 -

from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()

【讨论】：

我实际上尝试安装 geograpy 但失败了。这就是我依赖 nltk 的原因。
同样的问题无法安装 geograpy :(
安装geography前请先安装NLTK，也可以pip install geograpy-nltk
对于地理来说，这对我有用：stackoverflow.com/questions/31172719/…
旧但用于 python3 - pip3 install geograpy3

【解决方案3】：

因此，在卓有成效的 cmets 之后，我深入挖掘了不同的 NER 工具，以找到识别国籍和国家提及的最佳方法，并发现 SPACY 有一个 NORP 实体，可以有效地提取国籍。 https://spacy.io/docs/usage/entity-recognition

【讨论】：

sPacy 非常棒，而且非常强大。我还建议您也使用 Alchemy API。尽管对于大数据，最好使用 sPacy，因为它不会为每个查询和结果强加交易成本。
我们知道，spacy 会将位置标记为 {GPE}。就我而言，我有两个标记为 GPE 的位置（例如印度、德里）。现在我的目标是确定哪个是城市和国家。请评论@Renaud

【解决方案4】：

如果你想提取国家名称，你需要的是 NER 标注器，而不是 POS 标注器。

命名实体识别 (NER) 是信息提取的一个子任务，旨在定位文本中的元素并将其分类为预定义的类别，例如人名、组织、位置、时间表达、数量、货币价值、百分比等。

查看斯坦福 NER 标记器！

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

【讨论】：

他已经进行实体提取了！！也许在不知不觉中。
你的回答只是给了他一个分类词的列表。你甚至没有给他提供 GPE 的列表。请编辑您的答案