Gensim 中的 Word2Vec 使用 model.most_similar答案

【问题标题】：Word2Vec in Gensim using model.most_similarGensim 中的 Word2Vec 使用 model.most_similar
【发布时间】：2017-09-07 10:34:23
【问题描述】：

我是 Gensim 中“Word2Vec”的新手。我想为文本构建一个 Word2Vec 模型（摘自维基百科：机器学习）并找到与“机器学习”最相似的词。

我目前的代码如下。

# import modules & set up logging
from gensim.models import Word2Vec

sentences = "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision."
# train word2vec on the sentences
model = Word2Vec(sentences, min_count=1)
vocab = list(model.wv.vocab.keys())
print(vocab[:10])

但是，对于词汇，我得到一个字符输出。

['M', 'a', 'c', 'h', 'i', 'n', 'e', ' ', 'l', 'r']

请帮助我使用 model.most_similar

获得 most_similar_words

【问题讨论】：

标签： python gensim word2vec

【解决方案1】：

Word2Vec 类期望它的 sentences 语料库是单个项目的可迭代来源，每个项目都是单词标记列表。

您提供的是单个字符串。如果它迭代它，它会得到单个字符。如果它随后尝试将这些单个字符解释为一个令牌列表，它仍然只会得到一个单个字符——因此它看到的唯一“单词”是单个字符。

至少，您希望您的语料库更像这样构建：

sentences = [
    "Machine learning is the subfield of computer science that, according to Arthur Samuel, gives computers the ability to learn without being explicitly programmed.[1][2][verify] Samuel, an American pioneer in the field of computer gaming and artificial intelligence, coined the term machine learning in 1959 while at IBM. Evolved from the study of pattern recognition and computational learning theory in artificial intelligence,[3] machine learning explores the study and construction of algorithms that can learn from and make predictions on data[4] – such algorithms overcome following strictly static program instructions by making data-driven predictions or decisions,[5]:2 through building a model from sample inputs. Machine learning is employed in a range of computing tasks where designing and programming explicit algorithms with good performance is difficult or infeasible; example applications include email filtering, detection of network intruders or malicious insiders working towards a data breach,[6] optical character recognition (OCR),[7] learning to rank, and computer vision.".split(),
]

这仍然只是一个“句子”，但它会在空白处被分割成单词标记。

还请注意，有用的 word2vec 结果需要大量不同的文本样本 - 玩具大小的示例通常不会显示 word2vec 以创建而闻名的单词相似性或单词相对排列。

【讨论】：

感谢您的回答。通过拆分单词，模型将“机器学习”视为“机器”、“学习”。不是吗？一种解决方案是使用二元组。但是，我可能会错过三个单词的词组？因此，我能做什么？在这方面有这样的词组吗？
顺便说一句，我试过 split() 的东西。但是，我仍然按字符输出：(
是的，一个简单的标记化会将“机器”和“学习”分开。如何预处理数据以确定传递给Word2Vec 的令牌取决于您。 Gensim 包括一个Phrases 类，它可以根据统计频率将一些配对标记提升为二元组；它可以在多次传递中应用，然后创建更大的 n-gram。但这与Word2Vec 操作是分开的。
sentences 需要是一个项目列表，其中每个项目本身就是一个字符串列表。我提供的代码应该可以工作；如果你仍然只是看到单字符词汇，你确定你做了同样的事情吗？如果你做对了，那么打印 sentences[0][0] 应该返回 "Machine"。