【问题标题】:Why Python NLTK does not tag correctly in spanish language?为什么 Python NLTK 不能用西班牙语正确标记?
【发布时间】:2020-04-24 00:21:22
【问题描述】:

我有以下代码:

import nltk

sent='El gato está bajo la mesa de cristal.'
nltk.pos_tag(word_tokenize(sent), lang='spa')

但是输出根本不准确:

[('El', 'NNP'),
 ('gato', 'NN'),
 ('está', 'NN'),
 ('bajo', 'NN'),
 ('la', 'FW'),
 ('mesa', 'FW'),
 ('de', 'FW'),
 ('cristal', 'NN'),
 ('.', '.')]

例如,es 应归类为动词。

如果我尝试使用英语短语:

import nltk

sent='The cat is under the cristal table.'
nltk.pos_tag(word_tokenize(sent), lang='spa')

一切正常:

[('The', 'DT'),
 ('cat', 'NN'),
 ('is', 'VBZ'),
 ('under', 'IN'),
 ('the', 'DT'),
 ('cristal', 'NN'),
 ('table', 'NN'),
 ('.', '.')]

请注意,我已经下载了所有的 nltk 资源。你能告诉我我在这里遗漏了什么,所以单词标签在西班牙语中不起作用吗?

【问题讨论】:

  • NLTK 中没有用于 POS 标记的西班牙模型。

标签: python machine-learning nlp nltk tokenize


【解决方案1】:

我找到了the following solution

from nltk.tag import StanfordPOSTagger
jar = 'D:/Downloads/stanford-postagger-full-2018-10-16/stanford-postagger-3.9.2.jar'
model = 'D:/Downloads/stanford-postagger-full-2018-10-16/models/spanish.tagger'

import os
java_path = "C:/Program Files/Java/jre1.8.0_191/bin/java.exe"
os.environ['JAVAHOME'] = java_path

pos_tagger = StanfordPOSTagger(model, jar, encoding='utf8' )
pos_tagger.tag('El gato está bajo la mesa de cristal'.split())

结果:

[('El', 'da0000'),
 ('gato', 'nc0s000'),
 ('está', 'vmip000'),
 ('bajo', 'sp000'),
 ('la', 'da0000'),
 ('mesa', 'nc0s000'),
 ('de', 'sp000'),
 ('cristal', 'nc0s000')]

【讨论】:

  • 您如何解释 POS 编码?某处有清单吗?
【解决方案2】:

试试这个:

import stanfordnlp
MODELS_DIR = '.'
stanfordnlp.download('es', MODELS_DIR) # Download the Spanish models
nlp = stanfordnlp.Pipeline(processors='tokenize,pos', models_dir=MODELS_DIR, treebank='es_ancora', use_gpu=True, pos_batch_size=3000) # Build the pipeline, specify part-of-speech processor's batch size
doc = nlp("Tu frse en español.") # Run the pipeline on input text
doc.sentences[0].print_tokens() # Look at the result

【讨论】:

    猜你喜欢
    • 2014-09-22
    • 2013-01-21
    • 1970-01-01
    • 2017-04-15
    • 1970-01-01
    • 2014-06-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多