如何使用 user_id 对数据进行聚类 - k-means 算法答案

【问题标题】：How to cluster data with user_id - k-means algorithm如何使用 user_id 对数据进行聚类 - k-means 算法
【发布时间】：2019-05-03 12:08:30
【问题描述】：

我想通过user_id对用户的数据进行聚类，因为聚类后我需要对每个聚类进行分析。我的聚类算法是 k-means/k=3。我正在使用 python。

我的数据：

我从该数据中删除了user_id 列。据我所知，我应该删除 user_id 以进行 k-means 聚类。

我的python代码：

# -*- coding: utf-8 -*-
"""
Spyder Editor

This is a temporary script file.
"""

from copy import deepcopy
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

# Importing the dataset
data = pd.read_csv('C:/Users/S.M_Emamian/Desktop/xclara.csv')
print("Input Data and Shape")
print(data.shape)
data.head()

# Getting the values and plotting it
f1 = data['V1'].values
f2 = data['V2'].values
X = np.array(list(zip(f1, f2)))
plt.scatter(f1, f2, c='black', s=7)

# Euclidean Distance Caculator
def dist(a, b, ax=1):
    return np.linalg.norm(a - b, axis=ax)

# Number of clusters
k = 3
# X coordinates of random centroids
C_x = np.random.randint(0, np.max(X)-20, size=k)
# Y coordinates of random centroids
C_y = np.random.randint(0, np.max(X)-20, size=k)
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Initial Centroids")
print(C)

# Plotting along with the Centroids
plt.scatter(f1, f2, c='#050505', s=7)
plt.scatter(C_x, C_y, marker='*', s=200, c='g')

# To store the value of centroids when it updates
C_old = np.zeros(C.shape)
# Cluster Lables(0, 1, 2)
clusters = np.zeros(len(X))
# Error func. - Distance between new centroids and old centroids
error = dist(C, C_old, None)
# Loop will run till the error becomes zero
while error != 0:
    # Assigning each value to its closest cluster
    for i in range(len(X)):
        distances = dist(X[i], C)
        cluster = np.argmin(distances)
        clusters[i] = cluster
    # Storing the old centroid values
    C_old = deepcopy(C)
    # Finding the new centroids by taking the average value
    for i in range(k):
        points = [X[j] for j in range(len(X)) if clusters[j] == i]
        C[i] = np.mean(points, axis=0)
    error = dist(C, C_old, None)

colors = ['r', 'g', 'b', 'y', 'c', 'm']
fig, ax = plt.subplots()
for i in range(k):
        points = np.array([X[j] for j in range(len(X)) if clusters[j] == i])
        ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(C[:, 0], C[:, 1], marker='*', s=200, c='#050505')



'''
==========================================================
scikit-learn
==========================================================
'''

from sklearn.cluster import KMeans

# Number of clusters
kmeans = KMeans(n_clusters=3)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_

# Comparing with scikit-learn centroids
print("Centroid values")
print("Scratch")
print(C) # From Scratch
print("sklearn")
print(centroids) # From sci-kit learn

我的代码运行良好，它还可以可视化我的数据。

但我需要保留user_id。

例如，我想知道user_id=5是哪个集群？

【问题讨论】：

Kmeans 聚类使用欧几里得距离进行聚类。因此，在聚类中使用 user_id 并不是一个好主意，因为计算 user_id 之间的欧几里德距离没有任何意义。您可以正常聚类您的数据并使用 user_id 识别每个样本。
从我的角度来看，您在某处有user_id 列，只是您没有将它们提供给聚类算法（正确）。你能具体说明一下这个问题吗？
我想知道user_id=5是哪个集群？
类似的问题发生在我身上并弄清楚了如何 -> k_means output ranked by user_id
export the output of k-means algorithm with the ids in the original data

标签： python cluster-analysis k-means data-science

【解决方案1】：

集群后添加user_id即可。

实际上，您可能想要做的恰恰相反：只需将集群标签添加到仍然具有集群标签的原始数据中。

只要不更改数据顺序，这就是一个微不足道的堆叠操作。

【讨论】：

感谢您的回复。请解释你的答案。你是什么意思：ads the cluster label to your original data that still has the cluster labels. 或 As long as you don't change the data order this is a trivial stacking operation.