在 scikit-learn 中通过 EM 估计集群的数量答案

【问题标题】：Estimate number of clusters through EM in scikit-learn在 scikit-learn 中通过 EM 估计集群的数量
【发布时间】：2018-08-29 09:29:52
【问题描述】：

我正在尝试使用Weka 中的 EM 实现集群估计方法，更准确地说是以下描述：

为确定聚类数量而执行的交叉验证是按以下步骤完成：

集群数量设置为 1

训练集被随机分成 10 份。

EM 执行 10 次，使用 10 倍的常用 CV 方式。

对数似然是所有 10 个结果的平均值。

如果对数似然增加，则聚类数增加 1，程序继续执行步骤 2。

我目前的实现如下：

def estimate_n_clusters(X):
   "Find the best number of clusters through maximization of the log-likelihood from EM."
   last_log_likelihood = None
   kf = KFold(n_splits=10, shuffle=True)
   components = range(50)[1:]
   for n_components in components:
       gm = GaussianMixture(n_components=n_components)

       log_likelihood_list = []
       for train, test in kf.split(X):
           gm.fit(X[train, :])
           if not gm.converged_:
               raise Warning("GM not converged")
           log_likelihood = np.log(-gm.score_samples(X[test, :]))

           log_likelihood_list += log_likelihood.tolist()

       avg_log_likelihood = np.average(log_likelihood_list)

       if last_log_likelihood is None:
           last_log_likelihood = avg_log_likelihood
       elif avg_log_likelihood+10E-6 <= last_log_likelihood:
           return n_components
       last_log_likelihood = avg_log_likelihood

我通过 Weka 和我的函数获得了相似数量的集群，但是，使用函数估计的集群数量 n_clusters

gm = GaussianMixture(n_components=n_clusters).fit(X)
print(np.log(-gm.score(X)))

结果为 NaN，因为 -gm.score(X) 产生负结果（大约 -2500）。而Weka报告Log likelihood: 347.16447。

我的猜测是Weka第4步中提到的可能性与functionscore_samples()中提到的可能性不一样。

谁能告诉我哪里出错了？

谢谢

【问题讨论】：

标签： python scikit-learn artificial-intelligence cluster-analysis weka

【解决方案1】：

根据文档，score 返回平均 log 可能性。显然，您不想使用 log-log。

【讨论】：

令人尴尬的是，这是真的。让我失望的是，在迭代过程中，我得到了价值数百万的 avg_log_likelihood，所以我在应用日志时认为它至少应该在最后一个结果的范围内，但显然不是。简化删除该日志，我得到了与 Weka 发现的相同的最终可能性。谢谢！

【解决方案2】：

为了将来参考，固定函数如下所示：

def estimate_n_clusters(X):
   "Find the best number of clusters through maximization of the log-likelihood from EM."
   last_log_likelihood = None
   kf = KFold(n_splits=10, shuffle=True)
   components = range(50)[1:]
   for n_components in components:
       gm = GaussianMixture(n_components=n_components)

       log_likelihood_list = []
       for train, test in kf.split(X):
           gm.fit(X[train, :])
           if not gm.converged_:
               raise Warning("GM not converged")
           log_likelihood = -gm.score_samples(X[test, :])

           log_likelihood_list += log_likelihood.tolist()

       avg_log_likelihood = np.average(log_likelihood_list)
       print(avg_log_likelihood)

       if last_log_likelihood is None:
           last_log_likelihood = avg_log_likelihood
       elif avg_log_likelihood+10E-6 <= last_log_likelihood:
           return n_components-1
       last_log_likelihood = avg_log_likelihood

【讨论】：