【发布时间】:2018-05-28 21:36:49
【问题描述】:
我曾尝试在 R 中手动执行 Kernel K-Means 算法,但我的循环运行时间超过 30 分钟,代码如下:
#Calculanting kernel k-means
rbfkmeans<-function(data,c,q=0.02,L=0.7){
#associating random classifications to each observation
iter=0
data<-data%>%
mutate(cluster=sample(1:c,nrow(data),replace=T))
mini=rep(1,nrow(data))
## DISTÂNCIA EUCLIDIANA
# Remember:
#1.|| a || = sqrt(aDOTa),
#2. d(x,y) = || x - y || = sqrt((x-y)DOT(x-y))
#3. aDOTb = sum(a*b)
d<-function(x,y){
aux=x-y
dis=sqrt(sum(aux*aux))
return(dis)
}
##Radial Basis Function Kernel
# Remember :
# 1.K(x,x')=exp(-q||x-x'||^2) where ||x-x'|| is could be defined as the
# euclidian distance and 'q' it's the gamma parameter
rbf<-function(x,y,q=0.2){
aux<-d(x,y)
rbfd<-exp(-q*(aux)^2)
return(rbfd)
}
#
#calculating the kernel matrix
kernelmatrix=matrix(0,nrow(data),nrow(data))
for(i in 1:nrow(data)){
for(j in 1:nrow(data)){
kernelmatrix[i,j]=rbf(data[i,1:(ncol(data)-1)],data[j,1:(ncol(data)-1)],q)
}
}
r=rep(0,nrow(data))
distance=matrix(0,nrow(data),c)
while( (sum(r==data[,'cluster'])!=nrow(data)) && iter <30 ){
ans=0
#Calculating the distaces in the kernelized versions (RBF example)
print('running')
third=rep(0,c)#here third means the calculation from centers distances
#as they not depend of each obserativion.
for(g in 1:c){
ans=0
for(k in 1:nrow(data)){
for(l in 1:nrow(data)){
ans = ans + (data[k,'cluster']==g)*(data[l,'cluster']==g)*kernelmatrix[k,l]
}
}
third[g]=ans
}
for (ii in 1:nrow(data)){ #for (ii in 1:nrow(data))
for(j in 1:c) { #for(j in 1:c)
distance[ii,j]= kernelmatrix[ii,ii]-2*sum((data[,'cluster']==j)*kernelmatrix[ii,])/sum(data[,'cluster']==j)+third[j]/(sum(data[,'cluster']==j)^2)
}
}
r=data[,'cluster']
#Checking the shortest distance
for(k in 1:nrow(data)){
data[k,'cluster']=match(min(distance[k,]),distance[k,])
mini[k]=min(distance[k,])
}
plot(data[1:(ncol(data)-1)], col=data$cluster)
iter=iter+1
print(paste('Iteration number:',iter))
print(paste('Mean of min. distances:',mean(mini)))
#print(g==data$'cluster')
}
return(data)
}
有人知道我该如何优化吗?主要问题是 #third 项的计算,我想在循环内验证 (data[k,'cluster']==g) 浪费了太多时间,但我没有更多的想法来改进它......
OBS:data[k,'cluster']==g,用于验证观察是否属于集群。
编辑:需要很长时间才能运行它的代码部分:
for(g in 1:c){
ans=0
for(k in 1:nrow(data)){
for(l in 1:nrow(data)){
ans = ans + (data[k,'cluster']==g)*(data[l,'cluster']==g)*kernelmatrix[k,l]
}
}
third[g]=ans
}
【问题讨论】:
-
嗨 Mateus,您的代码的可重现示例(请参阅 here)将帮助我们回答您。
-
R 解释器 非常慢。改写 C 库中的关键代码部分。避免使用任何类型的解释器循环。尝试使用向量化操作,因为隐藏在其中的循环通常会在 C 或 Fortran 中运行 - 比 R 循环快得多。
标签: r loops for-loop optimization cluster-analysis