如何使用 Gap 统计量在层次聚类中找到最佳聚类数？答案

【问题标题】：How to find optimal number of clusters in hierarchical clustering using Gap statistic?如何使用 Gap 统计量在层次聚类中找到最佳聚类数？
【发布时间】：2017-04-28 09:56:18
【问题描述】：

我想通过单一链接运行层次聚类，以对具有 300 个特征和 1500 个观察值的聚类文档进行聚类。我想找到这个问题的最佳聚类数。

以下链接使用以下代码查找具有最大间隙的簇数。

http://www.sthda.com/english/wiki/determining-the-optimal-number-of-clusters-3-must-known-methods-unsupervised-machine-learning

# Compute gap statistic 
set.seed(123)

iris.scaled <- scale(iris[, -5])

gap_stat <- clusGap(iris.scaled, FUN = hcut, K.max = 10, B = 50)

# Plot gap statistic 
fviz_gap_stat(gap_stat)

但是在链接中hcut没有明确定义。如何为clusGap() 函数指定单链接层次聚类？

我们在 python 中是否有 clusGap() 的等价物？

谢谢

【问题讨论】：

标签： python r cluster-analysis hierarchical-clustering unsupervised-learning

【解决方案1】：

hcut() 函数是您发布的链接中使用的 factorextra 包的一部分：

hcut 包：factoextra R 文档

计算层次聚类并切割树

说明：

 Computes hierarchical clustering (hclust, agnes, diana) and cut
 the tree into k clusters. It also accepts correlation based
 distance measure methods such as "pearson", "spearman" and
 "kendall".

R 还有一个内置函数hclust()，可用于执行层次聚类。但是，默认情况下，它不执行单链接聚类，因此您不能简单地将hcut 替换为hclust。

但是，如果您查看clusGap() 的帮助，您会发现您可以提供一个自定义的聚类函数来应用：

FUNcluster：一个“函数”，它接受第一个参数 a（数据）像“x”这样的矩阵，第二个参数，比如 k，k >= 2，数字所需的集群，并返回一个带有组件的“列表” 命名（或缩写为）“集群”，它是一个长度向量 ‘1:k’中的整数‘n = nrow(x)’确定聚类或对“n”个观察值进行分组。

hclust()函数可以进行单链接层次聚类，所以你可以这样做：

cluster_fun <- function(x, k) list(cluster=cutree(hclust(dist(x), method="single"), k=k))
gap_stat <- clusGap(iris.scaled, FUN=cluster_fun, K.max=10, B=50)

【讨论】：