【问题标题】:Elbow Method for kmeanskmeans 的肘部方法
【发布时间】:2021-06-07 04:31:17
【问题描述】:

我正在处理一个聚类任务,我使用Elbow Method 来获得最佳聚类数 (k),但我得到了一个线性图,我无法从图中确定 k。 [在此处输入图片描述][2]

谢谢

enter image description here

【问题讨论】:

    标签: python machine-learning cluster-analysis


    【解决方案1】:
    There are many ways to do this kind of thing.  For one thing, you can use Yellowbrick to do the work.
    
    
    import pandas as pd
    import matplotlib as mpl 
    import matplotlib.pyplot as plt
    from mpl_toolkits.mplot3d import Axes3D
    
    from sklearn.cluster import KMeans
    from sklearn.datasets import make_blobs
    from sklearn import datasets
    
    from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
    
    mpl.rcParams["figure.figsize"] = (9,6)
    
    # Load iris flower dataset
    iris = datasets.load_iris()
    
    X = iris.data #clustering is unsupervised learning hence we load only X(i.e.iris.data) and not Y(i.e. iris.target)
    # Converting the data into dataframe
    feature_names = iris.feature_names
    iris_dataframe = pd.DataFrame(X, columns=feature_names)
    iris_dataframe.head(10)
    
    # Fitting the model with a dummy model, with 3 clusters (we already know there are 3 classes in the Iris dataset)
    k_means = KMeans(n_clusters=3)
    k_means.fit(X)
    
    # Plotting a 3d plot using matplotlib to visualize the data points
    fig = plt.figure(figsize=(7,7))
    ax = fig.add_subplot(111, projection='3d')
    
    # Setting the colors to match cluster results
    colors = ['red' if label == 0 else 'purple' if label==1 else 'green' for label in k_means.labels_]
    
    ax.scatter(X[:,3], X[:,0], X[:,2], c=colors)
    

    # Instantiate the clustering model and visualizer
    model = KMeans()
    visualizer = KElbowVisualizer(model, k=(2,11))
    
    visualizer.fit(X)    # Fit the data to the visualizer
    visualizer.show()    # Draw/show/show the data
    

    请参阅下面的链接了解更多信息。

    https://notebook.community/DistrictDataLabs/yellowbrick/examples/gokriznastic/Iris%20-%20clustering%20example

    https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20-%20Historical%20Stock%20Prices.ipynb

    【讨论】:

    • 非常感谢!
    【解决方案2】:

    我建议您使用轮廓分数来确定聚类的数量,它不需要您查看绘图并且可以完全自动 - 只需尝试不同的 k 值并选择具有最小轮廓分数的那个:

    https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

    但是,在这种特定情况下,这似乎无法解决您的问题。 如果数据点在空间上分布得相当均匀,这意味着它们并没有真正形成任何集群,那么就不会有最佳 k 值。 以此处最后一行为例:

    https://scikit-learn.org/stable/modules/clustering.html

    k 表示在技术上确实创建了不同的集群,但它们并不像您希望的集群那样真正彼此分开。 在这种情况下,不会有最小的轮廓分数,肘部方法将不起作用。这可能就是您的情况,数据中没有真正的集群......

    【讨论】:

    • 太有帮助了!非常感谢。
    猜你喜欢
    • 2023-03-15
    • 2017-05-23
    • 2016-12-23
    • 2019-11-18
    • 2018-11-04
    • 2018-10-10
    • 2019-09-15
    • 2020-03-25
    • 2019-10-26
    相关资源
    最近更新 更多