【发布时间】:2017-02-28 14:11:14
【问题描述】:
当我为一组文档运行 tfidf 时,它返回给我一个看起来像这样的 tfidf 矩阵。
(1, 12) 0.656240233446
(1, 11) 0.754552023393
(2, 6) 1.0
(3, 13) 1.0
(4, 2) 1.0
(7, 9) 1.0
(9, 4) 0.742540927053
(9, 5) 0.66980069547
(11, 19) 0.735138466738
(11, 7) 0.677916982176
(12, 18) 1.0
(13, 14) 0.697455191865
(13, 11) 0.716628394177
(14, 5) 1.0
(15, 8) 1.0
(16, 17) 1.0
(18, 1) 1.0
(19, 17) 1.0
(22, 13) 1.0
(23, 3) 1.0
(25, 6) 1.0
(26, 19) 0.476648253537
(26, 7) 0.879094103268
(28, 10) 0.532672175403
(28, 7) 0.523456282204
我想知道这是什么,我无法理解这是如何提供的。 当我处于调试模式时,我了解了索引、indptr 和数据......这些东西与给定的数据相关。这些是什么? 数字中有很多混淆,如果我说括号中的第一个元素是基于我的预测的文档,我看不到第 0、第 5、第 6 个文档。 请帮我弄清楚它是如何在这里工作的。但是,我从 wiki 知道 tfidf 的一般工作,记录反向文档和其他内容。我只想知道这3个不同的数字是什么,它指的是什么?
源代码是:
#This contains the list of file names
_filenames =[]
#This conatains the list if contents/text in the file
_contents = []
#This is a dict of filename:content
_file_contents = {}
class KmeansClustering():
def kmeansClusters(self):
global _report
self.num_clusters = 5
km = KMeans(n_clusters=self.num_clusters)
vocab_frame = TokenizingAndPanda().createPandaVocabFrame()
self.tfidf_matrix, self.terms, self.dist = TfidfProcessing().getTfidFPropertyData()
km.fit(self.tfidf_matrix)
self.clusters = km.labels_.tolist()
joblib.dump(km, 'doc_cluster2.pkl')
km = joblib.load('doc_cluster2.pkl')
class TokenizingAndPanda():
def tokenize_only(self,text):
'''
This function tokenizes the text
:param text: Give the text that you want to tokenize
:return: it gives the filter tokes
'''
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
return filtered_tokens
def tokenize_and_stem(self,text):
# first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]
filtered_tokens = []
# filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
for token in tokens:
if re.search('[a-zA-Z]', token):
filtered_tokens.append(token)
stems = [_stemmer.stem(t) for t in filtered_tokens]
return stems
def getFilnames(self):
'''
:return:
'''
global _path
global _filenames
path = _path
_filenames = FileAccess().read_all_file_names(path)
def getContentsForFilenames(self):
global _contents
global _file_contents
for filename in _filenames:
content = FileAccess().read_the_contents_from_files(_path, filename)
_contents.append(content)
_file_contents[filename] = content
def createPandaVocabFrame(self):
global _totalvocab_stemmed
global _totalvocab_tokenized
#Enable this if you want to load the filenames and contents from a file structure.
# self.getFilnames()
# self.getContentsForFilenames()
# for name, i in _file_contents.items():
# print(name)
# print(i)
for i in _contents:
allwords_stemmed = self.tokenize_and_stem(i)
_totalvocab_stemmed.extend(allwords_stemmed)
allwords_tokenized = self.tokenize_only(i)
_totalvocab_tokenized.extend(allwords_tokenized)
vocab_frame = pd.DataFrame({'words': _totalvocab_tokenized}, index=_totalvocab_stemmed)
print(vocab_frame)
return vocab_frame
class TfidfProcessing():
def getTfidFPropertyData(self):
tfidf_vectorizer = TfidfVectorizer(max_df=0.4, max_features=200000,
min_df=0.02, stop_words='english',
use_idf=True, tokenizer=TokenizingAndPanda().tokenize_and_stem, ngram_range=(1, 1))
# print(_contents)
tfidf_matrix = tfidf_vectorizer.fit_transform(_contents)
terms = tfidf_vectorizer.get_feature_names()
dist = 1 - cosine_similarity(tfidf_matrix)
return tfidf_matrix, terms, dist
【问题讨论】:
-
你说的是scikit-learn tf-idf 吗?你能把你的代码的一部分和你想从中提取信息的文档中提取出来吗?
-
是的,它的 scikit-learn tf-idf 是在说什么,我的代码的一部分贴在上面,希望你能帮助我
标签: python-3.x machine-learning tf-idf