【发布时间】:2025-11-26 13:50:01
【问题描述】:
我有一些我一直在玩的文本,在上面我有一个拉丁语内容的英文摘要。我正在尝试对两个文本执行 NER 以提取日期、位置和人员。我从英语部分开始认为它应该更容易并使用分块。日期未被识别,并非所有实体都被捕获。有没有办法自定义输出以使其更准确。 这是我的代码示例:
text = 'Thursday, 3 September 1467. The Jew Azar Nifusi leases his fields called Ta Xellula and Gnien Hagem in the district of Dejr is-Safsaf for ten years to Nicolaus Delia and his son Lemus for the price of eight salme of wheat each harvest-time. The tenants also bind themselves to give Nifusi each year ten salme of brushwood and two salme of straw. On his part the Jew promised to build a surrounding wall for the fields at his own expense.'
import nltk
sentences = nltk.sent_tokenize(text)
tokenized_sentences = [nltk.word_tokenize(sentence) for sentence in sentences]
tagged_sentences = [nltk.pos_tag(sentence) for sentence in tokenized_sentences]
chunked_sentences = nltk.ne_chunk_sents(tagged_sentences, binary = True)
entity_names = []
for tree in chunked_sentences:
entity_names.extend(extract_entity_names(tree))
print set(entity_names)
这是我得到的输出:
set(['Nicolaus Delia', 'Gnien Hagem', 'Dejr', 'Nifusi', 'Jew'])
我预计至少要提取日期,Jew,Azar Nifusi,Ta Xellula,Gnien Hagem,Dejr is-Safsaf,Nicolaus Delia 和 Lemus。请问有什么帮助吗?
【问题讨论】:
标签: python nltk named-entity-recognition