【发布时间】:2018-06-04 13:21:56
【问题描述】:
您好,我有一个格式如lemma 所示的词形还原文本。我想获得每个单词的 TfIdf 分数,这是我写的函数:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
lemma=["'Ah", 'yes', u'say', 'softly', 'Harry',
'Potter', 'Our', 'new', 'celebrity', 'You',
'learn', 'subtle', 'science', 'exact', 'art',
'potion-making', u'begin', 'He', u'speak', 'barely',
'whisper', 'caught', 'every', 'word', 'like',
'Professor', 'McGonagall', 'Snape', 'gift',
u'keep', 'class', 'silent', 'without', 'effort',
'As', 'little', 'foolish', 'wand-waving', 'many',
'hardly', 'believe', 'magic', 'I', 'dont', 'expect', 'really',
'understand', 'beauty']
def Tfidf_Vectorize(lemmas_name):
vect = TfidfVectorizer(stop_words='english',ngram_range=(1,2))
vect_transform = vect.fit_transform(lemmas_name)
# First approach of creating a dataframe of weight & feature names
vect_score = np.asarray(vect_transform.mean(axis=0)).ravel().tolist()
vect_array = pd.DataFrame({'term': vect.get_feature_names(), 'weight': vect_score})
vect_array.sort_values(by='weight',ascending=False,inplace=True)
# Second approach of getting the feature names
vect_fn = np.array(vect.get_feature_names())
sorted_tfidf_index = vect_transform.max(0).toarray()[0].argsort()
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
return vect_array
tf_dataframe=Tfidf_Vectorize(lemma)
print(tf_dataframe.iloc[:5,:])
我得到的输出:
print('Largest Tfidf:\n{}\n'.format(vect_fn[sorted_tfidf_index[:-11:-1]]))
是
Largest Tfidf:
[u'yes' u'fools' u'fury' u'gale' u'ghosts' u'gift' u'glory' u'glow' u'good'
u'granger']
tf_dataframe的结果
term weight
261 snape 0.027875
238 say 0.022648
211 potter 0.013937
181 mind 0.010453
123 harry 0.010453
60 dark 0.006969
75 dumbledore 0.006969
311 voice 0.005226
125 head 0.005226
231 ron 0.005226
这两种方法不应该导致顶级功能的相同结果吗?我只想计算 tfidf 分数并获得前 5 个特征/权重。我做错了什么?
【问题讨论】:
标签: python nlp nltk data-analysis tf-idf