【发布时间】:2018-12-25 04:58:03
【问题描述】:
我正在尝试计算 silhouette score,因为我找到了要创建的最佳集群数量,但收到一条错误消息:
ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)
我无法理解其原因。这是我用来聚类和计算silhouette score 的代码。
我读取了包含要聚类的文本的 csv,并在 n 聚类值上运行 K-Means。我收到此错误的原因可能是什么?
#Create cluster using K-Means
#Only creates graph
import matplotlib
#matplotlib.use('Agg')
import re
import os
import nltk, math, codecs
import csv
from nltk.corpus import stopwords
from gensim.models import Doc2Vec
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.metrics import silhouette_score
model_name = checkpoint_save_path
loaded_model = Doc2Vec.load(model_name)
#Load the test csv file
data = pd.read_csv(test_filename)
overview = data['overview'].astype('str').tolist()
overview = filter(bool, overview)
vectors = []
def split_words(text):
return ''.join([x if x.isalnum() or x.isspace() else " " for x in text ]).split()
def preprocess_document(text):
sp_words = split_words(text)
return sp_words
for i, t in enumerate(overview):
vectors.append(loaded_model.infer_vector(preprocess_document(t)))
sse = {}
silhouette = {}
for k in range(1,15):
km = KMeans(n_clusters=k, max_iter=1000, verbose = 0).fit(vectors)
sse[k] = km.inertia_
#FOLLOWING LINE CAUSES ERROR
silhouette[k] = silhouette_score(vectors, km.labels_, metric='euclidean')
best_cluster_size = 1
min_error = float("inf")
for cluster_size in sse:
if sse[cluster_size] < min_error:
min_error = sse[cluster_size]
best_cluster_size = cluster_size
print(sse)
print("====")
print(silhouette)
【问题讨论】:
-
可以添加数据吗?
-
代码中的哪一行导致错误?
-
@seralouk 这是来自我的谷歌驱动器的 CSV/数据的链接:drive.google.com/open?id=1pM0RvqyQI5IIqc_UbQL6b54p_DnnxHED
-
@R.F.Nelson 抱歉,我刚刚在问题中用评论标记了它。以下行创建错误:
silhouette_score(vectors, km.labels_, metric='euclidean') -
你也可以上传test_filename文件吗?
标签: python pandas machine-learning scikit-learn k-means