使用不同颜色和标签的集群答案

【问题标题】：Cluster using different colours and labels使用不同颜色和标签的集群
【发布时间】：2020-09-09 22:40:12
【问题描述】：

我正在研究文本聚类。我需要使用不同的颜色绘制数据。我使用kmeans 方法进行聚类，使用tf-idf 进行相似性。

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

plt.scatter(data2D[:,0], data2D[:,1])

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=np.array([kmeans.labels_])

目前，我的输出看起来像：有一些元素，因为它是一个测试。我需要添加标签（它们是字符串）并按集群区分点：每个集群都应该有自己的颜色，以使读者易于分析图表。

您能告诉我如何更改代码以同时包含标签和颜色吗？我认为任何例子都会很棒。

我的数据集的一个样本是（上面的输出是从不同的样本生成的）：

句子

Where do we do list them? ...
Make me a list of the things we would need and I'll take you into town. ...
Do you have a list yet? ...
The first was a list for Howie. ...
You're not on my list tonight. ...
I'm gonna print this list on my computer, given you're always bellyaching about my writing.

【问题讨论】：

这里我看到了使用plotly 的完美案例。您介意提供mcve 吗？至少你原来的 df 有一列集群。
这个有帮助吗adding colors and labels
@rpanai，请查看更新后的问题。
@CarlosAzevedo，我怎样才能相应地编辑我的代码？
@still_learning 我提供了它作为答案

标签： python matplotlib cluster-analysis k-means tf-idf

【解决方案1】：

我们可以使用一个示例数据集：

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

newsgroups = fetch_20newsgroups(subset='train',
                                categories=['talk.religion.misc','sci.space', 'misc.forsale'])
X_train = newsgroups.data
y_train = newsgroups.target

pipeline = Pipeline([('tfidf', TfidfVectorizer(max_features=5000))])
X = pipeline.fit_transform(X_train).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

像你一样做 KMeans，获取集群和中心，所以只需为集群添加一个名称：

kmeans =KMeans(n_clusters=3).fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
labels=kmeans.labels_
cluster_name = ["Cluster"+str(i) for i in set(labels)]

您可以通过向"c=" 提供集群并从cm 调用颜色图或定义您自己的地图来添加颜色：

plt.scatter(data2D[:,0], data2D[:,1],c=labels,cmap='Set3',alpha=0.7)
for i, txt in enumerate(cluster_name):
    plt.text(centers2D[i,0], centers2D[i,1],s=txt,ha="center",va="center")

你也可以考虑使用 seaborn：

sns.scatterplot(data2D[:,0], data2D[:, 1], hue=labels, legend='full',palette="Set1")

【讨论】：

【解决方案2】：

拿起您的代码尝试以下操作：

kmeans_labels =KMeans(n_clusters=3).fit(vectorized_text).labels_

pipeline = Pipeline([('tfidf', TfidfVectorizer())])
X = pipeline.fit_transform(X_train['Sentences']).todense()

pca = PCA(n_components=2).fit(X)
data2D = pca.transform(X)

kmeans.fit(X)
centers2D = pca.transform(kmeans.cluster_centers_)
group = kmeans.labels_

cdict = {0: 'red', 1: 'blue', 2: 'green'}
ldict = {0: 'label_1', 1: 'label_2', 2: 'label_3'}

fig, ax = plt.subplots()
for g in np.unique(group):
    ix = np.where(group == g)
    ax.scatter(data2D[:,0][ix], data2D[:,1][ix], c=cdict[g], label=ldict[g], s=100)
ax.legend()
plt.show()

我假设您的 kmeans 具有 n_clusters=3。 cdict 和ldict 需要根据集群的数量进行相应的设置。在这种情况下，集群 0 将是红色，标签为 label_1，集群 1 将是蓝色，标签为 label_2，依此类推。

编辑：我将 cdict 的键更改为从 0 开始。编辑 2：添加标签。

【讨论】：