【问题标题】:How to get the optimal number of clusters using hierarchical cluster analysis automatically in python?如何在python中使用层次聚类分析自动获得最佳聚类数?
【发布时间】:2018-11-14 15:41:00
【问题描述】:

我想使用层次聚类分析自动获得最佳聚类数(K),然后将此K应用于python中的K-means聚类

研究了很多文章,我知道有些方法告诉我们可以绘制图形来确定K,但是有什么方法可以在python中自动输出一个实数?

【问题讨论】:

  • 在你的情况下定义“最佳 k”
  • 我相信层次聚类本身就是一种聚类值的方法,所以应用聚类算法是没有意义的,找出它返回多少聚类然后应用另一种算法(在你的情况下是 K-means) .如果我错了,请纠正我
  • @Tedil 表示最优簇数。为了清楚起见,我编辑了我的问题。感谢您的建议。
  • “最佳”以什么衡量?这不客观。而 k 均值的方法只是非常粗略的启发式方法,选择坏 k 的频率与选择好 k 的频率一样多。

标签: python cluster-analysis hierarchical-clustering


【解决方案1】:

层次聚类方法是基于树状图来确定最佳聚类数。使用类似于以下的代码绘制树状图:

# General imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage

# Load data, fill in appropriately
X = []

# How to cluster the data, single is minimal distance between clusters
linked = linkage(X, 'single')

# Plot dendrogram
plt.figure(figsize=(10, 7))
dendrogram(linked,
            orientation='top',
            labels=labelList,
            distance_sort='descending',
            show_leaf_counts=True)
plt.show()

在树状图中找到节点之间最大的垂直差异,并在中间通过一条水平线。与它相交的垂直线数是最优的聚类数(当使用联动中设置的方法计算亲和力时)。

在此处查看示例:https://stackabuse.com/hierarchical-clustering-with-python-and-scikit-learn/

我也想知道如何自动读取树状图并提取该数字。

在编辑中添加: 有一种方法可以使用 SK Learn 包。请参阅以下示例:

#==========================================================================
# Hierarchical Clustering - Automatic determination of number of clusters
#==========================================================================

# General imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from os import path

# Special imports
from scipy.cluster.hierarchy import dendrogram, linkage
import scipy.cluster.hierarchy as shc
from sklearn.cluster import AgglomerativeClustering

# %matplotlib inline

print("============================================================")
print("       Hierarchical Clustering demo - num of clusters       ")
print("============================================================")
print(" ")


folder = path.dirname(path.realpath(__file__)) # set current folder

# Load data
customer_data = pd.read_csv( path.join(folder, "hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv"))
# print(customer_data.shape)
print("In this data there should be 5 clusters...")

# Retain only the last two columns
data = customer_data.iloc[:, 3:5].values

# # Plot dendrogram using SciPy
# plt.figure(figsize=(10, 7))
# plt.title("Customer Dendograms")
# dend = shc.dendrogram(shc.linkage(data, method='ward'))

# plt.show()


# Initialize hiererchial clustering method, in order for the algorithm to determine the number of clusters
# put n_clusters=None, compute_full_tree = True,
# best distance threshold value for this dataset is distance_threshold = 200
cluster = AgglomerativeClustering(n_clusters=None, affinity='euclidean', linkage='ward', compute_full_tree=True, distance_threshold=200)

# Cluster the data
cluster.fit_predict(data)

print(f"Number of clusters = {1+np.amax(cluster.labels_)}")

# Display the clustering, assigning cluster label to every datapoint 
print("Classifying the points into clusters:")
print(cluster.labels_)

# Display the clustering graphically in a plot
plt.scatter(data[:,0],data[:,1], c=cluster.labels_, cmap='rainbow')
plt.title(f"SK Learn estimated number of clusters = {1+np.amax(cluster.labels_)}")
plt.show()

print(" ")

数据取自这里:https://stackabuse.s3.amazonaws.com/files/hierarchical-clustering-with-python-and-scikit-learn-shopping-data.csv

【讨论】:

    猜你喜欢
    • 2017-04-28
    • 2013-07-11
    • 2012-10-26
    • 2013-02-28
    • 1970-01-01
    • 1970-01-01
    • 2018-10-22
    相关资源
    最近更新 更多