【问题标题】:Extract Dates, Persons and Locations from Latin and English text从拉丁文和英文文本中提取日期、人物和地点
【发布时间】:2025-11-26 13:50:01
【问题描述】:

我有一些我一直在玩的文本,在上面我有一个拉丁语内容的英文摘要。我正在尝试对两个文本执行 NER 以提取日期、位置和人员。我从英语部分开始认为它应该更容易并使用分块。日期未被识别,并非所有实体都被捕获。有没有办法自定义输出以使其更准确。 这是我的代码示例:

text = 'Thursday, 3 September 1467. The Jew Azar Nifusi leases his fields called Ta Xellula and Gnien Hagem in the district of Dejr is-Safsaf for ten years to Nicolaus Delia and his son Lemus for the price of eight salme of wheat each harvest-time. The tenants also bind themselves to give Nifusi each year ten salme of brushwood and two salme of straw. On his part the Jew promised to build a surrounding wall for the fields at his own expense.'
import nltk 
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary = True)

entity_names = []
for tree in chunked_sentences:
    entity_names.extend(extract_entity_names(tree))
print set(entity_names)

这是我得到的输出:

set(['Nicolaus Delia', 'Gnien Hagem', 'Dejr', 'Nifusi', 'Jew'])

我预计至少要提取日期,Jew,Azar Nifusi,Ta Xellula,Gnien Hagem,Dejr is-Safsaf,Nicolaus Delia 和 Lemus。请问有什么帮助吗?

【问题讨论】:

    标签: python nltk named-entity-recognition


    【解决方案1】:

    使用这行代码可以得到日期等信息。它采用树格式,但我假设您以后可以自己以更清晰的格式提取内容。

    ne_chunk(pos_tag(word_tokenize(text)))
    

    输出:

    Tree('S', [('Thursday', 'NNP'), (',', ','), ('3', 'CD'), ('September', 'NNP'), ('1467', 'CD'), ('.', '.'), ('The', 'DT'), Tree('ORGANIZATION', [('Jew', 'NNP'), ('Azar', 'NNP'), ('Nifusi', 'NNP')]), ('leases', 'VBZ'), ('his', 'PRP$'), ('fields', 'NNS'), ('called', 'VBD'), Tree('PERSON', [('Ta', 'NNP'), ('Xellula', 'NNP')]), ('and', 'CC'), Tree('PERSON', [('Gnien', 'NNP'), ('Hagem', 'NNP')]), ('in', 'IN'), ('the', 'DT'), ('district', 'NN'), ('of', 'IN'), Tree('GPE', [('Dejr', 'NNP')]), ('is-Safsaf', 'NN'), ('for', 'IN'), ('ten', 'JJ'), ('years', 'NNS'), ('to', 'TO'), Tree('PERSON', [('Nicolaus', 'NNP'), ('Delia', 'NNP')]), ('and', 'CC'), ('his', 'PRP$'), ('son', 'NN'), Tree('PERSON', [('Lemus', 'NNP')]), ('for', 'IN'), ('the', 'DT'), ('price', 'NN'), ('of', 'IN'), ('eight', 'CD'), ('salme', 'NNS'), ('of', 'IN'), ('wheat', 'NN'), ('each', 'DT'), ('harvest-time', 'NN'), ('.', '.'), ('The', 'DT'), ('tenants', 'NNS'), ('also', 'RB'), ('bind', 'VBP'), ('themselves', 'PRP'), ('to', 'TO'), ('give', 'VB'), Tree('PERSON', [('Nifusi', 'NNP')]), ('each', 'DT'), ('year', 'NN'), ('ten', 'RB'), ('salme', 'NN'), ('of', 'IN'), ('brushwood', 'NN'), ('and', 'CC'), ('two', 'CD'), ('salme', 'NN'), ('of', 'IN'), ('straw', 'NN'), ('.', '.'), ('On', 'IN'), ('his', 'PRP$'), ('part', 'NN'), ('the', 'DT'), Tree('ORGANIZATION', [('Jew', 'NNP')]), ('promised', 'VBD'), ('to', 'TO'), ('build', 'VB'), ('a', 'DT'), ('surrounding', 'VBG'), ('wall', 'NN'), ('for', 'IN'), ('the', 'DT'), ('fields', 'NNS'), ('at', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('expense', 'NN'), ('.', '.')])
    

    【讨论】:

    • 在这种情况下,不是每个句子都创建一棵树吗?我尝试使用 IOB 标签以及本教程 nlpforhackers.io/named-entity-extraction 但结果非常弱
    • 是的,创建了一棵树。但是,您始终可以从树中提取所需的信息。你是什​​么意思结果很弱?
    • 例如:Jew Azar Nifusi 应该只是 Azar Nifusi,因为那是一个名字和姓氏,而不是一个组织。或者 Ta' Xellula 是一个位置而不是一个人。您将如何清理这样的树并使其更准确?
    • 你考虑过使用 SpaCy 吗?我通常将其用于NER。我在这里将结果发布在 NLTK 中,因为您的问题与 NLTK 有关。对于我的论文,我使用了 SpaCy,并且对结果非常满意。
    • 还没有......我会试一试,也许它会给我更好的结果。我以为我做错了什么!感谢您的帮助
    最近更新 更多