在 scikit-learn 管道中使用 Word2Vec答案

【问题标题】：Using Word2Vec in scikit-learn pipeline在 scikit-learn 管道中使用 Word2Vec
【发布时间】：2021-03-17 16:56:41
【问题描述】：

我正在尝试对这个数据样本运行 w2v

Statement              Label
Says the Annies List political group supports third-trimester abortions on demand.       FALSE
When did the decline of coal start? It started when natural gas took off that started to begin in (President George W.) Bushs administration.         TRUE
"Hillary Clinton agrees with John McCain ""by voting to give George Bush the benefit of the doubt on Iran."""     TRUE
Health care reform legislation is likely to mandate free sex change surgeries.    FALSE
The economic turnaround started at the end of my term.     TRUE
The Chicago Bears have had more starting quarterbacks in the last 10 years than the total number of tenured (UW) faculty fired during the last two decades.    TRUE
Jim Dunnam has not lived in the district he represents for years now.    FALSE

使用此 GitHub 文件夹 (FeatureSelection.py) 中提供的代码：

https://github.com/nishitpatel01/Fake_News_Detection

我想在我的朴素贝叶斯模型中包含 word2vec 功能。首先我考虑了 X 和 y 并使用了 train_test_split：

X = df['Statement']
y = df['Label']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

dataset = pd.concat([X_train, y_train], axis=1)

这是我目前使用的代码：

#Using Word2Vec 
with open("glove.6B.50d.txt", "rb") as lines:
    w2v = {line.split()[0]: np.array(map(float, line.split()[1:]))
           for line in lines}

training_sentences = DataPrep.train_news['Statement']

model = gensim.models.Word2Vec(training_sentences, size=100) # x be tokenized text
w2v = dict(zip(model.wv.index2word, model.wv.syn0))


class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        # if a text is empty we should return a vector of zeros
        # with the same dimensionality as all the other vectors
        self.dim = len(word2vec.itervalues().next())

    def fit(self, X, y): # what are X and y?
        return self

    def transform(self, X): # should it be training_sentences?
        return np.array([
            np.mean([self.word2vec[w] for w in words if w in self.word2vec]
                    or [np.zeros(self.dim)], axis=0)
            for words in X
        ])


"""
class TfidfEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.word2weight = None
        self.dim = len(word2vec.itervalues().next())
    def fit(self, X, y):
        tfidf = TfidfVectorizer(analyzer=lambda x: x)
        tfidf.fit(X)
        # if a word was never seen - it must be at least as infrequent
        # as any of the known words - so the default idf is the max of 
        # known idf's
        max_idf = max(tfidf.idf_)
        self.word2weight = defaultdict(
            lambda: max_idf,
            [(w, tfidf.idf_[i]) for w, i in tfidf.vocabulary_.items()])
        return self
    def transform(self, X):
        return np.array([
                np.mean([self.word2vec[w] * self.word2weight[w]
                         for w in words if w in self.word2vec] or
                        [np.zeros(self.dim)], axis=0)
                for words in X
            ])
"""

在classifier.py中，我正在运行

nb_pipeline = Pipeline([
        ('NBCV',FeaturesSelection.w2v),
        ('nb_clf',MultinomialNB())])

但这不起作用，我收到此错误：

TypeError                                 Traceback (most recent call last)
<ipython-input-14-07045943a69c> in <module>
      2 nb_pipeline = Pipeline([
      3         ('NBCV',FeaturesSelection.w2v),
----> 4         ('nb_clf',MultinomialNB())])

/anaconda3/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
     71                           FutureWarning)
     72         kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 73         return f(**kwargs)
     74     return inner_f
     75 

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in __init__(self, steps, memory, verbose)
    112         self.memory = memory
    113         self.verbose = verbose
--> 114         self._validate_steps()
    115 
    116     def get_params(self, deep=True):

/anaconda3/lib/python3.7/site-packages/sklearn/pipeline.py in _validate_steps(self)
    160                                 "transformers and implement fit and transform "
    161                                 "or be the string 'passthrough' "
--> 162                                 "'%s' (type %s) doesn't" % (t, type(t)))
    163 
    164         # We allow last estimator to be None as an identity transformation

TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '{' ': array([-0.17019527,  0.32363772, -0.0770281 , -0.0278154 , -0.05182227, ....

我正在使用该文件夹中的所有程序，因此如果您使用它们，代码可以重现。

如果您能解释一下如何修复它以及需要对代码进行哪些其他更改，那就太好了。我的目标是将模型（朴素贝叶斯、随机森林等）与 BoW、TF-IDF 和 Word2Vec 进行比较。

更新：

在下面的答案（来自伊斯梅尔）之后，我将代码更新如下：

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

和

#building Linear SVM classfier
svm_pipeline = Pipeline([
        ('svmCV',FeaturesSelection_W2V.MeanEmbeddingVectorizer(FeaturesSelection_W2V.w2v)),
        ('svm_clf',svm.LinearSVC())
        ])

svm_pipeline.fit(DataPrep.train_news['Statement'], DataPrep.train_news['Label'])
predicted_svm = svm_pipeline.predict(DataPrep.test_news['Statement'])
np.mean(predicted_svm == DataPrep.test_news['Label'])

但是，我仍然遇到错误。

【问题讨论】：

你能想到一个可以从头到尾运行的minimal reproducible example吗？至于if it is the right way to proceed within the FS program 用密集的 word2vec 替换稀疏的 tfiidf 绝对是可能的，但如果这是你的最终目标，它不会让你更接近识别假新闻。为此，您需要提取事实并将其与您认为的事实进行比较。
当我从链接中的程序中取消注释模型和 w2v 时，我收到此错误：所有中间步骤都应该是转换器并实现拟合和转换或者是字符串。所以我认为缺少一个步骤，如果有人可以解释并展示缺少的内容以及我如何解决它，我将不胜感激。
对于帮助您的人，您的错误应该是可重现的（您可能会考虑使用错误消息 BTW 更新您的问题）。它应该是最小的。这就是because there was too much code 的原因。请考虑How to Ask 和minimal reproducible example。顺便说一句，你的 sklearn 转换器应该继承自 BaseEstimator 和 TransformerMixin 以便在 sklearn 管道中运行，但我不知道它是否足以让你的程序执行，因为我不知道如何运行它。
请查看更新。我不知道如何改进代码。一切都在链接中（对于可重复的示例）。非常感谢
错误很明显（但这是你的最后一个问题）：Pipeline 的每一步都必须实现fit 和transform。相反，您传递的是dict。但这是您的最后一个问题，因为您定义 w2v 的代码首先没有任何意义。您首先在字典中加载预训练的向量。然后定义一个 Word2Vec 模型。然后创建一个字典，用模型的未训练向量 (.-.) 压缩 Glove 文件中的单词

标签： python scikit-learn gensim word2vec

【解决方案1】：

步骤 1. MultinomialNB FeaturesSelection.w2v 是一个 dict，它没有 fit 或 fit_transform 功能。 MultinomialNB 也需要非负值，所以它不起作用。所以我决定添加一个预处理阶段来规范化负值。

from sklearn.preprocessing import MinMaxScaler

nb_pipeline = Pipeline([
        ('NBCV',MeanEmbeddingVectorizer(FeatureSelection.w2v)),
        ('nb_norm', MinMaxScaler()),
        ('nb_clf',MultinomialNB())
    ])

...而不是

nb_pipeline = Pipeline([
        ('NBCV',FeatureSelection.w2v),
        ('nb_clf',MultinomialNB())
    ])

第 2 步。我在 word2vec.itervalues().next() 上遇到错误。所以我决定用与Word2Vec的大小相同的预定义来改变尺寸形状。

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec, size=100):
        self.word2vec = word2vec
        self.dim = size

...而不是

class MeanEmbeddingVectorizer(object):
    def __init__(self, word2vec):
        self.word2vec = word2vec
        self.dim = len(word2vec.itervalues().next())

【讨论】：

谢谢伊斯梅尔。当我收到此错误时，它对我不起作用：AttributeError: 'dict' object has no attribute 'itervalues' which misses in self.dim.我还有一个关于该代码中 word2vec 的问题。在model = gensim.models.Word2Vec(X, size=100) 中使用X 是否适合您，或者您需要使用training_sentence 而不是X？
@LdM 我已经测试过并且可以正常工作。但是，准确率在 54% - 57% 之间。我认为您应该添加更多阶段以提高准确性。
对不起 Ismail，但我仍然收到错误 AttributeError: 'dict' object has no attribute 'itervalues'。你是如何修复/处理它的？
@LdM 我添加了步骤。请看第二步。
不幸的是，更改并没有解决问题。现在我收到一个新错误： AttributeError: 'int' object has no attribute 'transform' due to: nb_pipeline.fit(DataPrep....) 当我使用 FeatureSelection.MeanEmbeddingVectorizer(FeatureSelection.w2v 时，其他学习者也会发生这种情况)