如何计算k均值聚类matlab中的平方误差总和？答案

【问题标题】：How to compute the total sum of squared error in k-mean clustering matlab?如何计算k均值聚类matlab中的平方误差总和？
【发布时间】：2018-05-04 09:29:14
【问题描述】：

我正在为具有 k=# 集群的给定 4 维数据实施 k-means 算法，并且我使用不同的初始点运行了大约 5 次。

如何计算每次运行后的总平方误差 (SSE)？

4 Dimention 1 to 4 and blow
x1	1	2	3	4
x2	5	6	7	8
x3	9	10	11	12
x4	13	14	15	16
x5	17	18	19	20

如果有人可以帮助我，我会非常高兴。谢谢

【问题讨论】：

您在使用内置的kmeans 功能时遇到问题，还是从头开始构建？
@LeanderMoesinger 感谢您的评论。实际上我可以使用内置函数的 kmeans，但在 matlab 帮助中的示例中，我不明白我应该如何计算集群的平均值、中心、大小以及分配给每个集群的数据列表。

标签： algorithm matlab image-processing artificial-intelligence k-means

【解决方案1】：

kmeans() 函数已经直接提供了您想要的一切。它对 3 个集群具有以下语法：

[idx,CentreCoordinates,SEE] = kmeans(yourData,3);

在哪里

idx 是每个观察值的标签（在本例中为 1 到 3）
CentreCoordinates是聚类中心的坐标（每行一个中心）
SEE 是每个观测到其最近的集群中心 - SEE 的集群内欧几里得距离之和。

由于您实际上不需要索引，因此您可以使用 ~（波浪号）忽略函数的第一个输出：

[~,CentreCoordinates,SEE] = kmeans(yourData,3);

【讨论】：

如果您在读取 .xls 文件时遇到问题，yourData = xlsread('path/to/file/filename.xls','B:E'); 应该这样做。像这样它只读取第 2:5 列，因为第 1 列是无用的。

【解决方案2】：

此代码使用内置的 MATLAB 函数“k-means”。您需要使用自己的 k-means 算法对其进行修改。它显示了簇质心和平方误差之和（也称为分布）的计算。

clc; close all; clear all; 
data = readtable('data.txt'); % Importing the data-set
d1 = table2array(data(:, 2)); % Data in first dimension 
d2 = table2array(data(:, 3)); % Data in second dimension
d3 = table2array(data(:, 4)); % Data in third dimension 
d4 = table2array(data(:, 5)); % Data in fourth dimension 
X = [d1, d2, d3, d4]; % Combining the data into a matrix
k = 3; % Number of clusters
idx = kmeans(X, 3); % Alpplying the k-means using inbuilt funciton 
%% Separating the data in different dimension
d1_1 = d1(idx == 1); % d1 for the data in cluster 1 
d2_1 = d2(idx == 1); % d2 for the data in cluster 1
d3_1 = d3(idx == 1); % d3 for the data in cluster 1
d4_1 = d4(idx == 1); % d4 for the data in cluster 1
%==============================
d1_2 = d1(idx == 2); % d1 for the data in cluster 2 
d2_2 = d2(idx == 2); % d2 for the data in cluster 2
d3_2 = d3(idx == 2); % d3 for the data in cluster 2
d4_2 = d4(idx == 2); % d4 for the data in cluster 2
%==============================
d1_3 = d1(idx == 3); % d1 for the data in cluster 3
d2_3 = d2(idx == 3); % d2 for the data in cluster 3
d3_3 = d3(idx == 3); % d3 for the data in cluster 3
d4_3 = d4(idx == 3); % d4 for the data in cluster 3
%% Finding the co-ordinates of the cluster centroids
c1_d1 = mean(d1_1); % d1 value of the centroid for cluster 1
c1_d2 = mean(d2_1); % d2 value of the centroid for cluster 1
c1_d3 = mean(d3_1); % d2 value of the centroid for cluster 1
c1_d4 = mean(d4_1); % d2 value of the centroid for cluster 1
%====================================
c2_d1 = mean(d1_2); % d1 value of the centroid for cluster 2
c2_d2 = mean(d2_2); % d2 value of the centroid for cluster 2
c2_d3 = mean(d3_2); % d2 value of the centroid for cluster 2
c2_d4 = mean(d4_2); % d2 value of the centroid for cluster 2
%====================================
c3_d1 = mean(d1_3); % d1 value of the centroid for cluster 3
c3_d2 = mean(d2_3); % d2 value of the centroid for cluster 3
c3_d3 = mean(d3_3); % d2 value of the centroid for cluster 3
c3_d4 = mean(d4_3); % d2 value of the centroid for cluster 3
%% Calculating the distortion
distortion = 0; % Initialization
for n1 = 1 : length(d1_1)    
    distortion = distortion + ( ( ( c1_d1 - d1_1(n1) ).^2 ) + ( ( c1_d2 - d2_1(n1) ).^2 ) + ...
                                                    ( ( c1_d3 - d3_1(n1) ).^2 ) + ( ( c1_d4 - d4_1(n1) ).^2 ) );                                                 
end
for n2 = 1 : length(d1_2)    
    distortion = distortion + ( ( ( c2_d1 - d1_2(n2) ).^2 ) + ( ( c2_d2 - d2_2(n2) ).^2 ) + ...
                                                    ( ( c2_d3 - d3_2(n2) ).^2 ) + ( ( c2_d4 - d4_2(n2) ).^2 ) );                                                 
end
for n3 = 1 : length(d1_3)    
    distortion = distortion + ( ( ( c3_d1 - d1_3(n3) ).^2 ) + ( ( c3_d2 - d2_3(n3) ).^2 ) + ...
                                                    ( ( c3_d3 - d3_3(n3) ).^2 ) + ( ( c3_d4 - d4_3(n3) ).^2 ) );                                                 
end
fprintf('The unnormalized sum of square error is %f\n', distortion);
fprintf('The co-ordinate of the cluster 1 is \t d1 = %f, d2 = %f, d3 = %f, d4 = %f\n', c1_d1, c1_d2, c1_d3, c1_d4);
fprintf('The co-ordinate of the cluster 2 is \t d1 = %f, d2 = %f, d3 = %f, d4 = %f\n', c2_d1, c2_d2, c2_d3, c2_d4);
fprintf('The co-ordinate of the cluster 3 is \t d1 = %f, d2 = %f, d3 = %f, d4 = %f\n', c3_d1, c3_d2, c3_d3, c3_d4);

【讨论】：

您可以使用命令'[idx,CentreCoordinates,SEE] = kmeans(X,3);'来验证答案
感谢您的评论和代码。我试图在我身边运行您的代码以查看发生了什么，但我收到错误消息：“变量索引超出表格尺寸”。在第 6 行：'d4 = table2array(data(:, 5)); % 第四维数据' 。请让我知道为什么我会收到此错误。谢谢
在我的版本中，当我保存数据并使用'data = readtable('data.txt');'导入时第一列包含数据编号，第二列包含 d1 值，第三列引用 d2 值，第四列包含 d3 值，第五列包含 d4 值。这就是 'd4 = table2array(data(:, 5));' 中发生的事情。我猜在您的版本中，您可能没有保存了数据编号，这就是您可能收到错误的原因。
谢谢。现在我可以在我身边跑步了。