如何在 k-means 聚类中设置每个聚类的最小观测数？答案

【问题标题】：How to set a minimum number of observations per clusters in k-means clustering?如何在 k-means 聚类中设置每个聚类的最小观测数？
【发布时间】：2019-09-19 16:11:18
【问题描述】：

我正在尝试根据用户的行为对一些产品进行聚类。我最后得到的是具有非常不同数量的观察值的集群。

我检查了 k-means 聚类参数，但找不到控制每个聚类的最小（或最大）观察数的参数。

例如，这里是观察的数量如何分布在不同的集群中。

cluster_id   num_observations
0   6
1   4
2   1
3   3
4   29
5   5

关于如何处理这个问题的任何帮助？还有其他聚类算法可以解决这个问题吗？

【问题讨论】：

你是如何计算集群的？根据 knn 的定义，但在每组中可以拥有的观察数量上设置大小，您的结果将是偏差并且结果可能不正确，特别是如果您计划并在真实数据上使用模型
这可能是一个好兆头，表明您应该为 KMeans 选择更少的集群！
我不知道你为什么要这样做，如果你这样做，它不是 k-means 聚类，但这里有一个想法：Do k-means clustering, then, for clusters below最小尺寸，找到离集群中心最近但不在集群中的邻居，并将其移动到那里。重复。但是，我不知道如何解释它的真正含义。

标签： pandas machine-learning scikit-learn data-science k-means

【解决方案1】：

对于那些仍在寻找答案的人。我找到了处理这类问题的good module 或this module

使用pip install size-constrained-clustering 或pip install git+https://github.com/jingw2/size_constrained_clustering.git 并使用MinMaxKMeansMinCostFlow，您可以在其中选择size_min 和size_max

n_samples = 2000
n_clusters = 3
X = np.random.rand(n_samples, 2)
model = minmax.MinMaxKMeansMinCostFlow(n_clusters, size_min=400,   size_max=800)
model.fit(X)
centers = model.cluster_centers_
labels = model.labels_

【讨论】：

【解决方案2】：

这将通过 k-means-constrained pip 库解决.. check here

例子：

>>> from k_means_constrained import KMeansConstrained
>>> import numpy as np
>>> X = np.array([[1, 2], [1, 4], [1, 0],
...                [4, 2], [4, 4], [4, 0]])
>>> clf = KMeansConstrained(
...     n_clusters=2,
...     size_min=2,
...     size_max=5,
...     random_state=0
... )
>>> clf.fit_predict(X)
array([0, 0, 0, 1, 1, 1], dtype=int32)
>>> clf.cluster_centers_
array([[ 1.,  2.],
       [ 4.,  2.]])
>>> clf.labels_
array([0, 0, 0, 1, 1, 1], dtype=int32)

【讨论】：