【发布时间】:2017-11-24 07:46:05
【问题描述】:
我基本上是使用 mini_batch_kmeans 和 kmeans 算法对我的一些文档进行聚类。我只是按照教程是 scikit-learn 网站,链接如下: http://scikit-learn.org/stable/auto_examples/text/document_clustering.html
他们正在使用一些向量化方法,其中之一是 HashingVectorizer。在 hashingVectorizer 中,他们正在使用 TfidfTransformer() 方法制作管道。
# Perform an IDF normalization on the output of HashingVectorizer
hasher = HashingVectorizer(n_features=opts.n_features,
stop_words='english', non_negative=True,
norm=None, binary=False)
vectorizer = make_pipeline(hasher, TfidfTransformer())
一旦这样做,我从中得到的矢量化器就没有 get_feature_names() 方法。但由于我使用它进行聚类,我需要使用这个“get_feature_names()”来获取“术语”
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i, end='')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end='')
print()
我该如何解决这个错误?
我的整个代码如下所示:
X_train_vecs, vectorizer = vector_bow.count_tfidf_vectorizer(_contents)
mini_kmeans_batch = MiniBatchKmeansTechnique()
# MiniBatchKmeans without the LSA dimensionality reduction
mini_kmeans_batch.mini_kmeans_technique(number_cluster=8, X_train_vecs=X_train_vecs,
vectorizer=vectorizer, filenames=_filenames, contents=_contents, is_dimension_reduced=False)
计数向量化器使用 tfidf 管道传输。
def count_tfidf_vectorizer(self,contents):
count_vect = CountVectorizer()
vectorizer = make_pipeline(count_vect,TfidfTransformer())
X_train_vecs = vectorizer.fit_transform(contents)
print("The count of bow : ", X_train_vecs.shape)
return X_train_vecs, vectorizer
mini_batch_kmeans 类如下:
class MiniBatchKmeansTechnique():
def mini_kmeans_technique(self, number_cluster, X_train_vecs, vectorizer,
filenames, contents, svd=None, is_dimension_reduced=True):
km = MiniBatchKMeans(n_clusters=number_cluster, init='k-means++', max_iter=100, n_init=10,
init_size=1000, batch_size=1000, verbose=True, random_state=42)
print("Clustering sparse data with %s" % km)
t0 = time()
km.fit(X_train_vecs)
print("done in %0.3fs" % (time() - t0))
print()
cluster_labels = km.labels_.tolist()
print("List of the cluster names is : ",cluster_labels)
data = {'filename':filenames, 'contents':contents, 'cluster_label':cluster_labels}
frame = pd.DataFrame(data=data, index=[cluster_labels], columns=['filename', 'contents', 'cluster_label'])
print(frame['cluster_label'].value_counts(sort=True,ascending=False))
print()
grouped = frame['cluster_label'].groupby(frame['cluster_label'])
print(grouped.mean())
print()
print("Top Terms Per Cluster :")
if is_dimension_reduced:
if svd != None:
original_space_centroids = svd.inverse_transform(km.cluster_centers_)
order_centroids = original_space_centroids.argsort()[:, ::-1]
else:
order_centroids = km.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(number_cluster):
print("Cluster %d:" % i, end=' ')
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind], end=',')
print()
print("Cluster %d filenames:" % i, end='')
for file in frame.ix[i]['filename'].values.tolist():
print(' %s,' % file, end='')
print()
【问题讨论】:
-
管道安装好了吗?请发布完整的代码。
-
是的,我将发布我的整个代码,请检查
-
第一件事是,你在用什么? HashingVectorizer 还是 CountVectorizer?
-
第二,CountVectorizer和TfidfTransformer不需要做pipeline。请改用 TfidfVectorizer。
-
第三,在教程中,他们只在非散列管道上调用
get_feature_names()。请参阅上面的if block,他们使用get_feature_names()。
标签: machine-learning scikit-learn cluster-analysis k-means