R按组进行无监督聚类（？）答案

【问题标题】：R Unsupervised Clustering by group (?)R按组进行无监督聚类（？）
【发布时间】：2019-10-22 16:00:21
【问题描述】：

我的主要和最重要的目标实际上是找到有很多点出现在同一条线上的组，我的想法是在 kmeans 的帮助下做到这一点，但也许你有更好的想法。

我将根据您可以在下面找到的两个图来解释它（每个图描述一个组）：

组 1 的图 1：

我们可以看到有很多点位于几乎相同的 y 轴上 --> 我试图弄清楚如何找到具有这种“点分布”的组

下面我们有 第 2 组的图 2，它没有显示这种“点分布”

在这里我们可以找到上面两个图对应的数据：

structure(list(Group = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1), 
    x = c(100L, 150L, 250L, 287L, 312L, 387L, 475L, 550L, 837L, 
    937L, 987L, 1087L, 1175L, 1300L, 1325L, 1487L, 1662L, 1700L, 
    1725L, 1812L, 1912L, 2412L, 3012L, 3562L, 4162L, 4762L, 5362L, 
    5750L, 5712L, 6225L, 6825L, 6887L, 7237L, 7850L, 7800L, 7937L, 
    7975L, 8275L, 8362L, 8662L, 8725L, 8950L, 9100L, 9312L, 9400L, 
    9600L, 550L, 612L, 1962L, 5412L, 8425L, 9375L, 5412L), y = c(493L, 
    482L, 479L, 476L, 481L, 479L, 474L, 480L, 480L, 491L, 489L, 
    490L, 485L, 485L, 485L, 479L, 482L, 482L, 482L, 482L, 484L, 
    489L, 491L, 489L, 496L, 498L, 500L, 0L, 498L, 500L, 502L, 
    506L, 497L, 0L, 495L, 506L, 497L, 494L, 498L, 500L, 496L, 
    499L, 496L, 495L, 495L, 498L, 442L, 447L, 394L, 465L, 806L, 
    700L, 502L)), row.names = c(23L, 24L, 25L, 26L, 27L, 28L, 
29L, 30L, 31L, 32L, 33L, 34L, 35L, 36L, 37L, 38L, 39L, 40L, 41L, 
42L, 43L, 44L, 45L, 46L, 47L, 48L, 49L, 51L, 52L, 53L, 54L, 55L, 
56L, 57L, 58L, 59L, 60L, 61L, 62L, 63L, 64L, 65L, 66L, 67L, 68L, 
69L, 574L, 575L, 576L, 577L, 578L, 579L, 815L), class = "data.frame")

简短说明：

Group   x   y
1 100 493
1 150 482
1 250 479
1 287 476
1 312 481
1 387 479

我们在这里有每个组（1 和 2）、x 和 y 坐标。

我目前的做法：

我已使用此代码将 y 轴四舍五入为 20

    round_any = function(x, accuracy, f=round){f(x/ accuracy) * accuracy} # function to round the y 
data$y_rd <- round_any(data$y, 20)

我这样做是因为通常点不会专门位于同一条 y 线上..

此外，我已使用此代码根据每个 y_rd（四舍五入的 y 坐标）的 x 坐标为每个组创建集群：

    data$id <- paste(data$Group, data$y_rd, sep = "_") # create id that contains Group and y_rd values
    res2 <- tapply(data$x, INDEX = data$id, function(x) kmeans(x,2)) # kmeans with fixed number of clusters    
    res3 <- lapply(names(res2), function(x) data.frame(y=x, Centers=res2[[x]]$centers, Size=res2[[x]]$size))     
    res3 <- do.call(rbind, res3)

但是它并不能满足我的需要，因为我无法为每个 Group 和 y_rd 定义固定的集群数量...

此时我被卡住了，不知道我可以采取什么方法来找到具有这种分布的组......

我想得到的结果：

Group Cluster MaxPoints
1      1         3
1      2         20
1      3         7

我愿意接受任何可以帮助我找到表现出这种集合的小组的想法或提示。谢谢！

【问题讨论】：

标签： r algorithm k-means

【解决方案1】：

您的问题的某些点我不清楚，所以这里有一个答案，也许它可以作为一个起点。

由于似乎最重要的变量是y，您可以尝试在组中研究它，然后将k-means应用于“获胜者”组。

首先，您可以尝试检测可能具有“线”分布的组，查看一些箱线图或一些直方图：

dats %>% ggplot(aes(y_rd)) + geom_histogram() + facet_wrap(vars(Group)) + theme_light()

现在似乎有一个组有一条长线和一个较小的集群 (1) 和一个有许多小集群的组 (2)。所以在这种情况下，您可以将数据分成有两个集群的组（和一条长线），有 1 条，还有一个有许多“小簇”但没有长线的组（2）。这个想法是把你的100组分成“没有长线”、“长线和1个小簇”、“长线和2个小簇”等。有了这些，您就可以拆分数据集并执行聚类。在这种情况下，我们丢弃第二组，并使用具有 2 个中心的 k-means 作为第二组，因为它似乎有一条长线和另一个小簇。

vec <- c(1)  # vector of groups that seems they've long line

 # a loop to cluster them: clearly this is fixed to two clusters, looking at the
 # histograms you can do n loop, one for similar distributions
listed <- list()
for (i in vec){
  clustering <- kmeans(dats[dats$Group == 1,c(4)],2)
  listed[[i]] <- data.frame(dats[dats$Group == i,c(4)],cl = clustering$cluster)
}

现在你可以绘制它了：

library(ggplot2)
ggplot(listed[[1]], aes(x,y, color = as.factor(cl))) + geom_point() + theme_light()

【讨论】：

这看起来已经相当不错了，但是在我的原始数据集中，我确实有大约 100 个组，我必须从所有组中找到哪一个具有如此长的沿 x 轴的点“线”。 . 并且因为有 100 个组 --> 将所有集群的数量设置为 3 将无法正常工作
我已经尝试在一定程度上管理您所拥有的复杂性，这就是我将继续进行的方式，希望对您有所帮助。