计算R中集群之间的总平方和答案

【问题标题】：Calculate total sum of squares between clusters in R计算R中集群之间的总平方和
【发布时间】：2023-03-26 05:25:01
【问题描述】：

我的目标是比较我使用的cluster_method_1 和cluster_method_2 的两种聚类方法中的哪一种具有最大的聚类间平方和，以确定哪一种实现更好的分离。

我基本上是在寻找一种有效的方法来计算集群 1 的每个点与集群 2、3、4 的所有点之间的距离，依此类推。

示例数据框：

structure(list(x1 = c(0.01762376, -1.147739752, 1.073605848, 
2.000420899, 0.01762376, 0.944438811, 2.000420899, 0.01762376, 
-1.147739752, -1.147739752), x2 = c(0.536193126, 0.885609849, 
-0.944699546, -2.242627057, -1.809984553, 1.834120637, 0.885609849, 
0.96883563, 0.186776403, -0.678508604), x3 = c(0.64707104, -0.603759684, 
-0.603759684, -0.603759684, -0.603759684, 0.64707104, -0.603759684, 
-0.603759684, -0.603759684, 1.617857394), x4 = c(-0.72712328, 
0.72730861, 0.72730861, -0.72712328, -0.72712328, 0.72730861, 
0.72730861, -0.72712328, -0.72712328, -0.72712328), cluster_method_1 = structure(c(1L, 
3L, 3L, 3L, 2L, 2L, 3L, 2L, 1L, 4L), .Label = c("1", "2", "4", 
"6"), class = "factor"), cluster_method_2 = structure(c(5L, 3L, 
1L, 3L, 4L, 2L, 1L, 1L, 1L, 6L), .Label = c("1", "2", "3", "4", 
"5", "6"), class = "factor")), row.names = c(NA, -10L), class = c("tbl_df", 
"tbl", "data.frame"))



        x1     x2     x3     x4 cluster_method_1 cluster_method_2
     <dbl>  <dbl>  <dbl>  <dbl> <fct>            <fct>           
 1  0.0176  0.536  0.647 -0.727 1                5               
 2 -1.15    0.886 -0.604  0.727 4                3               
 3  1.07   -0.945 -0.604  0.727 4                1               
 4  2.00   -2.24  -0.604 -0.727 4                3               
 5  0.0176 -1.81  -0.604 -0.727 2                4               
 6  0.944   1.83   0.647  0.727 2                2               
 7  2.00    0.886 -0.604  0.727 4                1               
 8  0.0176  0.969 -0.604 -0.727 2                1               
 9 -1.15    0.187 -0.604 -0.727 1                1               
10 -1.15   -0.679  1.62  -0.727 6                6

【问题讨论】：

所以你想计算聚类之间的所有成对距离？
正确，高效！

标签： r cluster-analysis

【解决方案1】：

考虑包 clValid。它计算大量用于验证聚类的索引。邓恩指数特别适合您尝试做的事情。该文档说，邓恩指数是不在同一集群中的观察之间的最小距离与最大集群内距离之间的比率。它的文档可以在https://cran.r-project.org/web/packages/clValid/clValid.pdf找到。

【讨论】：

【解决方案2】：

总平方和，sum_x sum_y ||x-y||² 是常数。

总平方和可以通过方差简单地计算出来。

如果您现在减去 x 和 y 属于同一簇的簇内平方和，则簇间平方和仍然存在。

如果您采用这种方法，则需要 O(n) 时间而不是 O(n²)。

推论：WCSS 最小的解具有最大的 BCSS。

【讨论】：

【解决方案3】：

聚类S_i的内平方和可以写成所有成对（欧几里德）距离的平方和除以点数的两倍在该集群中（参见例如the Wikipedia article on k-means clustering）

为方便起见，我们定义了一个函数calc_SS，它返回（数字）data.frame 的内平方和

calc_SS <- function(df) sum(as.matrix(dist(df)^2)) / (2 * nrow(df))

然后就可以直接计算每个方法的每个集群的内部（集群）平方和

library(tidyverse)
df %>%
    gather(method, cluster, cluster_method_1, cluster_method_2) %>%
    group_by(method, cluster) %>%
    nest() %>%
    transmute(
        method,
        cluster,
        within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    spread(method, within_SS)
## A tibble: 6 x 3
#  cluster cluster_method_1 cluster_method_2
#  <chr>              <dbl>            <dbl>
#1 1                   1.52             9.99
#2 2                  10.3              0
#3 3                  NA               10.9
#4 4                  15.2              0
#5 5                  NA                0
#6 6                   0                0

平方和内的总数就是每个簇的平方和内的总和

df %>%
    gather(method, cluster, cluster_method_1, cluster_method_2) %>%
    group_by(method, cluster) %>%
    nest() %>%
    transmute(
        method,
        cluster,
        within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    group_by(method) %>%
    summarise(total_within_SS = sum(within_SS)) %>%
    spread(method, total_within_SS)
## A tibble: 1 x 2
#  cluster_method_1 cluster_method_2
#             <dbl>            <dbl>
#1             27.0             20.9

顺便说一下，我们可以确认calc_SS 确实使用iris 数据集返回平方和：

set.seed(2018)
df2 <- iris[, 1:4]
kmeans <- kmeans(as.matrix(df2), 3)
df2$cluster <- kmeans$cluster

df2 %>%
    group_by(cluster) %>%
    nest() %>%
    mutate(within_SS = map_dbl(data, ~calc_SS(.x))) %>%
    arrange(cluster)
## A tibble: 3 x 3
#  cluster data              within_SS
#    <int> <list>                <dbl>
#1       1 <tibble [38 × 4]>      23.9
#2       2 <tibble [62 × 4]>      39.8
#3       3 <tibble [50 × 4]>      15.2

kmeans$within
#[1] 23.87947 39.82097 15.15100

【讨论】：

很好，谢谢。但是，我不是在寻找 insideSS。我正在寻找每个 i != j 的集群 i 和 j 之间所有点的 SS。所以基本上是不同集群的 SS，因为我需要测量分离度。
我认为你误解了（或者对 inside SS 有什么误解）：我做计算不同集群的 inside SS乙>。这些是cluster_method_1 和cluster_method_2 两列中的值。我还给出了 SS 内的总数，这只是 SS 内个人的总和。您是否将 SS 内部与 SS 之间混淆了？
仔细阅读您的帖子后，我注意到您要“计算集群1的每个点与集群2,3,4的所有点之间的距离” .在我看来，这似乎是一个奇怪的指标，用于比较两种聚类方法的“性能”。这很奇怪，因为它既不是内部也不是 SS 之间。比较例如使用 k-means 聚类，您可以最小化 SS 内的值。你能详细说明你真正想要做什么吗？
当然！基本上，我在同一个数据集（分层 K 均值和 SOM）上应用了两种不同的聚类方法。现在，我们知道聚类一般有两个目标：i）同一聚类中的主题需要尽可能相似，以及 ii）不同聚类中的主题需要尽可能不同。对于我正在从事的特定项目，（ii）比（i）更重要。因此，我想测量我的集群彼此之间的距离，而不仅仅是质心距离；我想测量“生活”在不同集群中的主题的距离
我不熟悉分层 k-means 聚类，但在标准 k-means 聚类中，目标函数正是 SS 内的。很容易证明，最小化内部 SS 对应于最大化质心之间的 SS。在这种情况下，我不确定您所说的 “ii) 不同集群中的主题需要尽可能不同” 是什么意思。在什么方面不同，与什么相比？ [...]