【发布时间】:2016-10-19 14:02:11
【问题描述】:
我想使用 nltk 从文本中提取所有提到的国家和国籍,我使用 POS 标记来提取所有带有 GPE 标记的标记,但结果并不令人满意。
abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "
sent = nltk.tokenize.wordpunct_tokenize(abstract)
pos_tag = nltk.pos_tag(sent)
nes = nltk.ne_chunk(pos_tag)
places = []
for ne in nes:
if type(ne) is nltk.tree.Tree:
if (ne.label() == 'GPE'):
places.append(u' '.join([i[0] for i in ne.leaves()]))
if len(places) == 0:
places.append("N/A")
得到的结果是:
['Thyroid', 'Australian', 'Caucasian', 'Graves']
有些是国籍,有些只是名词。
那么我做错了什么还是有其他方法可以提取此类信息?
【问题讨论】:
-
你没有错。您执行了实体提取,然后获取实体块并在其中搜索 GPE 标签。您对 NLTK 结果不满意的原因是 NLTK 通常在实体分类方面表现不佳。有可用于 GPE 的查找表。它们非常全面且非常有效。使用它们而不是依赖 NLTK。
-
谢谢,你能给我一个那些查找表的例子吗...
标签: python nlp nltk pos-tagger