【问题标题】:TfIDf Vectorizer weightsTfIDf 矢量化器权重
【发布时间】:2018-06-04 13:21:56
【问题描述】:

您好,我有一个格式如lemma 所示的词形还原文本。我想获得每个单词的 TfIdf 分数,这是我写的函数:

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer

lemma=["'Ah", 'yes', u'say', 'softly', 'Harry', 
       'Potter', 'Our', 'new', 'celebrity', 'You', 
       'learn', 'subtle', 'science', 'exact', 'art', 
       'potion-making', u'begin', 'He', u'speak', 'barely', 
       'whisper', 'caught', 'every', 'word', 'like', 
       'Professor', 'McGonagall', 'Snape', 'gift', 
       u'keep', 'class', 'silent', 'without', 'effort', 
       'As', 'little', 'foolish', 'wand-waving', 'many', 
       'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really', 
       'understand', 'beauty']

def Tfidf_Vectorize(lemmas_name):

    vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
    vect_transform = vect.fit_transform(lemmas_name)    

    # First approach of creating a dataframe of weight & feature names

    vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
    vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
    vect_array.sort_values(by='weight',ascending=False,inplace=True)

    # Second approach of getting the feature names

    vect_fn = np.array(vect.get_feature_names())    
    sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()

    print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))

    return vect_array

tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])

我得到的输出:

print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))

Largest Tfidf: 
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
 u'granger']

tf_dataframe的结果

       term  weight
261  snape       0.027875
238  say         0.022648
211  potter      0.013937
181  mind        0.010453
123  harry       0.010453
60   dark        0.006969
75   dumbledore  0.006969
311  voice       0.005226
125  head        0.005226
231  ron         0.005226

这两种方法不应该导致顶级功能的相同结果吗?我只想计算 tfidf 分数并获得前 5 个特征/权重。我做错了什么?

【问题讨论】:

    标签: python nlp nltk data-analysis tf-idf


    【解决方案1】:

    我不确定我在这里看到的是什么,但我感觉您使用 TfidfVectorizer 不正确。但是,请纠正我,以防我对您的尝试有错误的想法。

    所以.. 您需要的是您提供给fit_transform() 的文档列表。从中可以构造一个矩阵,例如,其中每一列代表一个文档,每一行代表一个单词。该矩阵中的一个单元格是文档 j 中单词 i 的 tf-idf 分数。

    这是一个例子:

    import numpy as np
    import pandas as pd
    
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    documents = [
        "This is a document.",
        "This is another document with slightly more text.",
        "Whereas this is yet another document with even more text than the other ones.",
        "This document is awesome and also rather long.",
        "The car he drove was red."
    ]
    
    document_names = ['Doc {:d}'.format(i) for i in range(len(documents))]
    
    def get_tfidf(docs, ngram_range=(1,1), index=None):
        vect = TfidfVectorizer(stop_words='english', ngram_range=ngram_range)
        tfidf = vect.fit_transform(documents).todense()
        return pd.DataFrame(tfidf, columns=vect.get_feature_names(), index=index).T
    
    print(get_tfidf(documents, ngram_range=(1,2), index=document_names))
    

    这会给你:

                        Doc 0     Doc 1     Doc 2     Doc 3     Doc 4
    awesome               0.0  0.000000  0.000000  0.481270  0.000000
    awesome long          0.0  0.000000  0.000000  0.481270  0.000000
    car                   0.0  0.000000  0.000000  0.000000  0.447214
    car drove             0.0  0.000000  0.000000  0.000000  0.447214
    document              1.0  0.282814  0.282814  0.271139  0.000000
    document awesome      0.0  0.000000  0.000000  0.481270  0.000000
    document slightly     0.0  0.501992  0.000000  0.000000  0.000000
    document text         0.0  0.000000  0.501992  0.000000  0.000000
    drove                 0.0  0.000000  0.000000  0.000000  0.447214
    drove red             0.0  0.000000  0.000000  0.000000  0.447214
    long                  0.0  0.000000  0.000000  0.481270  0.000000
    ones                  0.0  0.000000  0.501992  0.000000  0.000000
    red                   0.0  0.000000  0.000000  0.000000  0.447214
    slightly              0.0  0.501992  0.000000  0.000000  0.000000
    slightly text         0.0  0.501992  0.000000  0.000000  0.000000
    text                  0.0  0.405004  0.405004  0.000000  0.000000
    text ones             0.0  0.000000  0.501992  0.000000  0.000000
    

    您展示的两种获取单词的方法及其各自的分数计算所有文档的平均值并分别获取每个单词的最高分数。

    所以让我们这样做并比较这两种方法:

    df = get_tfidf(documents, ngram_range=(1,2), index=index)
    
    print(pd.DataFrame([df.mean(1), df.max(1)], index=['score_mean', 'score_max']).T)
    

    我们可以看到分数当然不一样。

                       score_mean  score_max
    awesome              0.096254   0.481270
    awesome long         0.096254   0.481270
    car                  0.089443   0.447214
    car drove            0.089443   0.447214
    document             0.367353   1.000000
    document awesome     0.096254   0.481270
    document slightly    0.100398   0.501992
    document text        0.100398   0.501992
    drove                0.089443   0.447214
    drove red            0.089443   0.447214
    long                 0.096254   0.481270
    ones                 0.100398   0.501992
    red                  0.089443   0.447214
    slightly             0.100398   0.501992
    slightly text        0.100398   0.501992
    text                 0.162002   0.405004
    text ones            0.100398   0.501992
    

    注意:

    您可以说服自己,这与在 TfidfVectorizer 上调用 min/max 的作用相同:

    vect = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
    tfidf = vect.fit_transform(documents)
    print(tfidf.max(0))
    print(tfidf.mean(0))
    

    【讨论】:

    • 感谢您提供如此详细的回复,我对 tfidf 的理解有点缺陷,我认为它们只是我字符串的一个文档。
    猜你喜欢
    • 2020-05-07
    • 2017-12-11
    • 2019-06-09
    • 2021-01-23
    • 2018-06-17
    • 2021-02-03
    • 2018-07-02
    • 2015-04-08
    • 2017-05-03
    相关资源
    最近更新 更多