KeyError("单词 '%s' 不在词汇表中" % word)答案

【问题标题】：KeyError("word '%s' not in vocabulary" % word)KeyError("单词 '%s' 不在词汇表中" % word)
【发布时间】：2019-09-19 06:32:08
【问题描述】：

将我的预测标签从图像转换为列表 all_tags，然后将它们拆分并最终存储到 word_list 中，其中所有标签都存储在类似句子的结构中。

我要做的就是使用 Google 的 Word2Vec 预训练模型 (https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/) 来生成和打印我预测标签的所有 Word2Vec 值。导入并映射模型的预训练权重，但出现错误

KeyError: "单词 '['cliff'' 不在词汇表中"

但是，字典中可以找到“悬崖”这个词。任何见解将不胜感激。请检查下面的代码 sn-ps 以供参考。

execution_path = os.getcwd()
TEST_PATH = '/home/guest/Documents/Aikomi'


prediction = ImagePrediction()
prediction.setModelTypeAsDenseNet()
prediction.setModelPath(os.path.join(execution_path, "/home/guest/Documents/Test1/ImageAI-master/imageai/Prediction/Weights/DenseNet.h5"))
prediction.loadModel()

pred_array = np.empty((0,6), dtype=object)

predictions, probabilities = prediction.predictImage(os.path.join(execution_path, "1.jpg"), result_count=5)

for img in os.listdir(TEST_PATH):
    if img.endswith('.jpg'):
        image = Image.open(os.path.join(TEST_PATH, img))
        image = image.convert("RGB")
        image = np.array(image, dtype=np.uint8)
        predictions, probabilities = prediction.predictImage(os.path.join(TEST_PATH, img), result_count=5)
        temprow = np.zeros((1,pred_array.shape[1]),dtype=object)
        temprow[0,0] = img
        for i in range(len(predictions)):
            temprow[0,i+1] = predictions[i]
        pred_array = np.append(pred_array, temprow, axis=0)


all_tags = list(pred_array[:,1:].reshape(1,-1))
_in_sent = ' '.join(list(map(str, all_tags)))


import gensim
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import re
import random
import nltk
nltk.download('punkt')


word_list = _in_sent.split() 

from gensim.corpora.dictionary import Dictionary

# be sure to split sentence before feed into Dictionary
word_list_2 = [d.split() for d in word_list]
dictionary = Dictionary(word_list_2)
print("\n", dictionary, "\n")

corpus_bow = [dictionary.doc2bow(doc) for doc in word_list_2]

model = Word2Vec(word_list_2, min_count= 1)
model = gensim.models.KeyedVectors.load_word2vec_format('/home/guest/Downloads/Google.bin', binary=True)

print(*map(model.most_similar, word_list))

【问题讨论】：

正如@imtinan-azhar 在他们的回答中指出的那样，您的错误实际上是报告单词"['cliff'" 不在单词向量中，这并不奇怪（即使单词"cliff" 是展示）。此外，您应该显示您想要解决的任何错误的完整错误堆栈——因此所涉及的确切行被突出显示。最后，model = Word2Vec(word_list_2, min_count= 1) 行对您的最终结果没有任何贡献，因为在下一行，您将不同的东西（一组加载的 Google 向量）分配到 model 变量中，丢弃您刚刚使用的任何 Word2Vec 模型已创建。

标签： python machine-learning deep-learning nlp word2vec

【解决方案1】：

答案就在那里，你写的很清楚

KeyError(“word '%s' not in vocabulary” % word)

错误是

KeyError: "单词 '['cliff'' 不在词汇表中"

由于变量word的内容应该在'和'之间因此单词变量有字符串['cliff'而不是字符串cliff

从文本中删除标点符号，例如 ' 和 [ ] 等。

【讨论】：

但是如果我打印 all_tags，这个词就很多了。
我明白了，但是用于查询字典的单词不是 cliff ，它有一个 [ 和 ' ，它是 ['cliff'，这就是它失败的原因，清理 word 变量只有字母，它会工作
试试这个，在查询字典之前将此行添加到您的代码中regex = re.compile('[^a-zA-Z]')word_list = [regex.sub('',w) for w in word_list]
它正在显示 Dictionary(522 个唯一标记：['cliff', 'valley', 'lakeside', 'seashore', 'promontory']...) 添加您的 sn-p 后，结果是一样的。 regex = re.compile('[^a-zA-Z]') word_list = [regex.sub('',w) for w in word_list] word_list = _in_sent.split() from gensim.corpora.dictionary import Dictionary
您能打印变量 words_list 并在此处显示吗？