在超过 2 维数据上绘制 kmeans 聚类答案

【问题标题】：plot kmeans clustering on more than 2 dimensional data在超过 2 维数据上绘制 kmeans 聚类
【发布时间】：2021-08-02 18:00:55
【问题描述】：

我有一个包含 6 列的数据集，在使用 KMEAN 之后，我需要在聚类后可视化绘图。我有六个集群。我该怎么做？这是我的 Kmeans 聚类代码：

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(psnr_bitrate)
kmeans = KMeans(init="random",n_clusters=6,n_init=10,max_iter=300,random_state=42)
kmeans.fit(scaled_features)
y_kmeans = kmeans.predict(scaled_features)

我在这个链接上找到了另一个帖子： How to visualize kmeans clustering on multidimensional data 但我无法理解解决方案，因为我不知道是什么

cluster

在那段代码中？！

我使用了以下代码：

from sklearn.preprocessing import StandardScaler
from sklearn import cluster

scaler = StandardScaler()
scaled_features = scaler.fit_transform(psnr_bitrate)
kmeans = KMeans(init="random",n_clusters=6,n_init=10,max_iter=300,random_state=42)
kmeans.fit(scaled_features)
y_kmeans = kmeans.predict(scaled_features)
scaled_features['cluster'] = y_kmeans
pd.tools.plotting.parallel_coordinates(scaled_features, 'cluster')

它会产生这个错误：

Traceback (most recent call last):

  File "<ipython-input-77-2e66d8a57100>", line 7, in <module>
    scaled_features['cluster'] = y_kmeans

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

我用于聚类的输入数据是一个像这样的 numpy 变量：

31.764833 35.632833 38.088500 39.877250 41.331917 42.923750
29.832750 34.567500 37.527417 39.621000 41.412583 43.023917
36.777167 41.151333 44.122500 46.237167 47.879083 49.832250
46.871500 52.006333 54.784583 57.099417 58.767833 60.674667

它有 6 列和 1301 行。但我的专栏没有名字。

【问题讨论】：

cluster 在该代码中对应于from sklearn import cluster
不，我认为这不是真的。因为在答案代码中我们有这个： from sklearn.preprocessing import StandardScaler scaler = StandardScaler() scaled_features = scaler.fit_transform(psnr_bitrate) kmeans = KMeans(init="random",n_clusters=6,n_init=10,max_iter=300, random_state=42) kmeans.fit(scaled_features) y_kmeans = kmeans.predict(scaled_features) scaled_features['cluster'] = y_kmeans pd.tools.plotting.parallel_coordinates(scaled_features, 'cluster')
and cluster 被用作我认为的列
是的，字符串"cluster" 在解决方案中用作pandas 数据框的列名。我还是不明白你不明白的...
我使用了上面的代码，使用集群会产生错误。请看上面我添加的新代码。

标签： python python-3.x pandas matplotlib plot

【解决方案1】：

scaled_features 是一个 numpy 数组，不能用字符串索引数组。您需要先将其转换为数据框：

scaled_features = pd.DataFrame(scaled_features)

【讨论】：

【解决方案2】：

有几点，pd.plotting.parallel_coordinates 更高版本的pandas 应该是pd.plotting.parallel_coordinates，如果你把你的预测器做成一个数据框就更容易了，例如：

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.decomposition import PCA

# import some data to play with
X = iris.data
y = iris.target

scaler = StandardScaler()
scaled_features = pd.DataFrame(scaler.fit_transform(X))

如果可以，请给出列名：

scaled_features.columns = iris.feature_names

Kmeans 和分配集群：

kmeans = KMeans(init="random",n_clusters=6,n_init=10,max_iter=300,random_state=42)
kmeans.fit(scaled_features)

scaled_features['cluster'] = kmeans.predict(scaled_features)

剧情：

pd.plotting.parallel_coordinates(scaled_features, 'cluster')

或者对你的特征和情节做一些降维：

from sklearn.manifold import MDS
import seaborn as sns

embedding = MDS(n_components=2)
mds = pd.DataFrame(embedding.fit_transform(scaled_features.drop('cluster',axis=1)),
             columns = ['component1','component2'])
mds['cluster'] = kmeans.predict(scaled_features.drop('cluster',axis=1))

sns.scatterplot(data=mds,x = "component1",y="component2",hue="cluster")

【讨论】：

这是一种使用平行坐标在这些图上显示集群中心的方法吗？
你能告诉我有什么方法可以在你上面画的图上显示集群中心吗？因为您使用此 pd.plotting.parallel_coordinates(scaled_features, 'cluster') 进行绘图，我也不知道如何在此绘图上显示集群中心。
我可以稍后再试，现在忙于工作