大型数据集中估计后的聚类分配 (Mclust)答案

【问题标题】：Cluster assignment after estimate in a large dataset (Mclust)大型数据集中估计后的聚类分配 (Mclust)
【发布时间】：2016-12-06 20:01:01
【问题描述】：

我一直在使用相对较大的数据集（约 50.000 个观察值和 16 个变量）进行聚类分析。

library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data

由于我的计算机无法做到这一点，我做了一些信息子集（示例中为 10 x 5.000、16.000，但计算时间为 15 分钟），并使用 Mclust 来确定最佳组数。

ind<- sample(1:nrow(mdper.f),size=16000)#sampling especial with 16.000, 15min cumputing 
nfac <- mdper.f[ind,]#sampling
Fnac <- scale(nfac) #scale data
mod = Mclust(Fnac) #Determining the optimal number of clusters
summary(mod) #Summary

#Results:
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust VII (spherical, varying volume) model with 9 components:

log.likelihood     n df    BIC      ICL
   128118.2 16000 80 255462 254905.3

Clustering table:
   1    2    3    4    5    6    7    8    9 
1879 2505 3452 3117 2846  464  822  590  325

结果总是 9（5.000 个数据集的 10 个中有 10 个），所以，我想没关系.. 现在，我想将估计的集群划分分配给其余数据，以便分配给集群的多维部分。

我该怎么做？

我开始使用 Mclust 对象，但我不知道如何处理它并将其应用于其余数据。例如，最佳解决方案是我的原始数据，其中包含一个额外的列，其中分配了簇号（1 到 9）。

【问题讨论】：

标签： r cluster-analysis sampling large-data

【解决方案1】：

工作几分钟后我得到了答案：

首先有一个概念错误，数据集必须在分区之前进行缩放，然后才使用predict()

library(mclust)
load(file="mdper.f.Rdata")#mdper.f = My stored data

mdper.f.s <- scale(mdper.f)#Scaling data 
ind<- sample(1:nrow(mdper.f.s),size=16000)#sampling with 16.000 
nfac <- mdper.f.s[ind,]#sampling
mod16 = Mclust(nfac)#Determining the optimal number of clusters, 15min cumputing with 7 vars

prediction<-predict(mod16 ,mdper.f.s )#Predict with calculated model and scaled data
mdper.f <- cbind(mdper.f,prediction$classification)#Assignment to the original data
colnames(mdper.f.pred)[8]<-"Cluster" #Assing name to the new column

【讨论】：