kmeans 的肘部方法答案

【问题标题】：Elbow Method for kmeanskmeans 的肘部方法
【发布时间】：2021-06-07 04:31:17
【问题描述】：

我正在处理一个聚类任务，我使用Elbow Method 来获得最佳聚类数 (k)，但我得到了一个线性图，我无法从图中确定 k。 [在此处输入图片描述][2]

谢谢

【问题讨论】：

标签： python machine-learning cluster-analysis

【解决方案1】：

There are many ways to do this kind of thing.  For one thing, you can use Yellowbrick to do the work.


import pandas as pd
import matplotlib as mpl 
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn import datasets

from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer

mpl.rcParams["figure.figsize"] = (9,6)

# Load iris flower dataset
iris = datasets.load_iris()

X = iris.data #clustering is unsupervised learning hence we load only X(i.e.iris.data) and not Y(i.e. iris.target)
# Converting the data into dataframe
feature_names = iris.feature_names
iris_dataframe = pd.DataFrame(X, columns=feature_names)
iris_dataframe.head(10)

# Fitting the model with a dummy model, with 3 clusters (we already know there are 3 classes in the Iris dataset)
k_means = KMeans(n_clusters=3)
k_means.fit(X)

# Plotting a 3d plot using matplotlib to visualize the data points
fig = plt.figure(figsize=(7,7))
ax = fig.add_subplot(111, projection='3d')

# Setting the colors to match cluster results
colors = ['red' if label == 0 else 'purple' if label==1 else 'green' for label in k_means.labels_]

ax.scatter(X[:,3], X[:,0], X[:,2], c=colors)

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(2,11))

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.show()    # Draw/show/show the data

https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20-%20Historical%20Stock%20Prices.ipynb

【讨论】：

非常感谢！

【解决方案2】：

我建议您使用轮廓分数来确定聚类的数量，它不需要您查看绘图并且可以完全自动 - 只需尝试不同的 k 值并选择具有最小轮廓分数的那个：

https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

但是，在这种特定情况下，这似乎无法解决您的问题。如果数据点在空间上分布得相当均匀，这意味着它们并没有真正形成任何集群，那么就不会有最佳 k 值。以此处最后一行为例：

https://scikit-learn.org/stable/modules/clustering.html

k 表示在技术上确实创建了不同的集群，但它们并不像您希望的集群那样真正彼此分开。在这种情况下，不会有最小的轮廓分数，肘部方法将不起作用。这可能就是您的情况，数据中没有真正的集群......

【讨论】：

太有帮助了！非常感谢。