如何使用 word2vec 进行文本分类答案

【问题标题】：How to do Text classification using word2vec如何使用 word2vec 进行文本分类
【发布时间】：2018-09-13 14:24:35
【问题描述】：

我想使用 word2vec 执行文本分类。我得到了单词的向量。

ls = []
sentences = lines.split(".")
for i in sentences:
    ls.append(i.split())
model = Word2Vec(ls, min_count=1, size = 4)
words = list(model.wv.vocab)
print(words)
vectors = []
for word in words:
    vectors.append(model[word].tolist())
data = np.array(vectors)
data

输出：

array([[ 0.00933912,  0.07960335, -0.04559333,  0.10600036],
       [ 0.10576613,  0.07267512, -0.10718666, -0.00804013],
       [ 0.09459028, -0.09901826, -0.07074171, -0.12022413],
       [-0.09893986,  0.01500741, -0.04796079, -0.04447284],
       [ 0.04403428, -0.07966098, -0.06460238, -0.07369237],
       [ 0.09352681, -0.03864434, -0.01743148,  0.11251986],.....])

我如何进行分类（产品和非产品）？

【问题讨论】：

标签： python-3.x word2vec gensim text-classification

【解决方案1】：

您的问题相当广泛，但我将尝试为您提供第一种分类文本文档的方法。

首先，我将决定如何将每个文档表示为一个向量。因此，您需要一种方法，该方法采用向量列表（单词）并返回一个向量。您要避免文档的长度影响此向量表示的内容。例如，您可以选择平均值。

def document_vector(array_of_word_vectors):
    return array_of_word_vectors.mean(axis=0)

array_of_word_vectors 在您的代码中是例如data。

现在您可以稍微调整一下距离（例如，余弦距离是一个不错的首选）并查看某些文档彼此之间的距离，或者 - 这可能是带来更快结果的方法 - 您可以使用文档向量为您从scikit learn 选择的分类算法构建训练集，例如逻辑回归。

文档向量将成为您的矩阵X，而您的向量y 是一个由 1 和 0 组成的数组，具体取决于您希望将文档分类到的二进制类别。

【讨论】：

【解决方案2】：

您已经拥有使用model.wv.syn0 的词向量数组。如果你打印它，你可以看到一个包含每个单词对应向量的数组。

您可以在此处查看使用 Python3 的示例：

import pandas as pd
import os
import gensim
import nltk as nl
from sklearn.linear_model import LogisticRegression


#Reading a csv file with text data
dbFilepandas = pd.read_csv('machine learning\\Python\\dbSubset.csv').apply(lambda x: x.astype(str).str.lower())

train = []
#getting only the first 4 columns of the file 
for sentences in dbFilepandas[dbFilepandas.columns[0:4]].values:
    train.extend(sentences)
  
# Create an array of tokens using nltk
tokens = [nl.word_tokenize(sentences) for sentences in train]

现在是时候使用向量模型了，在这个例子中，我们将计算 LogisticRegression。

# method 1 - using tokens in Word2Vec class itself so you don't need to train again with train method
model = gensim.models.Word2Vec(tokens, size=300, min_count=1, workers=4)

# method 2 - creating an object 'model' of Word2Vec and building vocabulary for training our model
model = gensim.models.Word2vec(size=300, min_count=1, workers=4)
# building vocabulary for training
model.build_vocab(tokens)
print("\n Training the word2vec model...\n")
# reducing the epochs will decrease the computation time
model.train(tokens, total_examples=len(tokens), epochs=4000)
# You can save your model if you want....

# The two datasets must be the same size
max_dataset_size = len(model.wv.syn0)

Y_dataset = []
# get the last number of each file. In this case is the department number
# this will be the 0 or 1, or another kind of classification. ( to use words you need to extract them differently, this way is to numbers)
with open("dbSubset.csv", "r") as f:
    for line in f:
        lastchar = line.strip()[-1]
        if lastchar.isdigit():
            result = int(lastchar) 
            Y_dataset.append(result) 
        else:
            result = 40 


clf = LogisticRegression(random_state=0, solver='lbfgs', multi_class='multinomial').fit(model.wv.syn0, Y_dataset[:max_dataset_size])

# Prediction of the first 15 samples of all features
predict = clf.predict(model.wv.syn0[:15, :])
# Calculating the score of the predictions
score = clf.score(model.wv.syn0, Y_dataset[:max_dataset_size])
print("\nPrediction word2vec : \n", predict)
print("Score word2vec : \n", score)

您还可以计算属于您创建的模型字典的单词的相似度：

print("\n\nSimilarity value : ",model.wv.similarity('women','men'))

您可以找到更多功能使用here。

【讨论】：

在第一行你已经创建了 Word2Vec 模型。为什么需要在令牌上训练模型？（第 4 行）model.train(tokens, total_examples=len(tokens), epochs=4000)
train.extend(sentences) 不是创建字符列表而不是令牌列表吗？不应该是train.append()吗？
@Joel 和 Krishna，你确定上面的代码有效吗？当我尝试运行时，它显示错误消息： AttributeError: 'KeyedVectors' object has no attribute 'syn0' 。我认为问题出在这里：model.wv.syn0
@tursunWali 当我编写代码时，它正在工作。当您运行它时，可能会出现一些库版本更改的问题。但是，您有代码库，它只是更新了一些代码部分以使其顺利运行 :) 我希望我能帮助您更多，但我目前正在休假，回复是在 2018 年，所以我不记得了：/
请分享库的版本，我降级库并重试。谢谢你。好吧，如果我能运行你或我的代码，我会很高兴：stackoverflow.com/questions/68494094/…