如何在 scikit-learn 中获得有意义的 kmeans 结果答案

【问题标题】：How to get meaningful results of kmeans in scikit-learn如何在 scikit-learn 中获得有意义的 kmeans 结果
【发布时间】：2015-05-16 04:34:29
【问题描述】：

我有一个如下所示的数据集：

{'dns_query_count'：'11'，'http_hostnames_count'：'7'，'dest_port_count'：'3'，'ip_count'：'11'，'signature_count'：'0'，'src_ip'：'10.0 .64.42', 'http_user_agent_count': '2'}

这已经从 csv 转换为 dict

然后我使用 DictVectorizer 来转换它

from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer()
d = vec.fit_transform(data).toarray()

然后我尝试在上面使用 Kmeans

from sklearn.cluster import KMeans
k = KMeans(n_clusters=2).fit(d)

我的问题是如何获得关于我的数据的哪一行属于哪个集群的信息？

我希望得到这样的东西：

{'dns_query_count'：'11'，'http_hostnames_count'：'7'，'dest_port_count'：'3'，'ip_count'：'11'，'signature_count'：'0'，'src_ip'：'10.0 .64.42'，'http_user_agent_count'：'2'，集群：'1'}

谁能给我一个分步示例，如何从我展示的原始数据到包含它们所属集群的信息的相同数据？

例如，我在这个数据集上使用了 Weka，它向我展示了我想要的东西 - 我可以单击图表上的数据点并准确读取哪些数据点属于哪个集群。如何使用 sklearn 获得相似的结果？

【问题讨论】：

标签： python machine-learning scikit-learn k-means

【解决方案1】：

这将展示如何检索每行的集群 ID 和集群中心。我还测量了每行到每个质心的距离，因此您可以看到这些行已正确分配给集群。

In [1]:

import pandas as pd
from sklearn.cluster import KMeans
from numpy.random import random
from scipy.spatial.distance import euclidean

# I'm going to generate some random data so you can just copy this and see it work

random_data = []

for i in range(0,10):
    random_data.append({'dns_query_count': random(),
 'http_hostnames_count': random(),
 'dest_port_count': random(),
 'ip_count': random(),
 'signature_count': random(),
 'src_ip': random(),
 'http_user_agent_count': random()}
)

df = pd.DataFrame(random_data)

km = KMeans(n_clusters=2).fit(df)

df['cluster_id'] = km.labels_

# get the cluster centers and compute the distance from each point to the center
# this will show that all points are assigned to the correct cluster

def distance_to_centroid(row, centroid):
    row = row[['dns_query_count',
                'http_hostnames_count',
                'dest_port_count',
                'ip_count',
                'signature_count',
                'src_ip',
                'http_user_agent_count']]
    return euclidean(row, centroid)

# to get the cluster centers use km.cluster_centers_

df['distance_to_center0'] = df.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[0]),1)

df['distance_to_center1'] = df.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[1]),1)

df.head()

Out [1]:
   dest_port_count  dns_query_count  http_hostnames_count  \
0         0.516920         0.135925              0.090209   
1         0.528907         0.898578              0.752862   
2         0.426108         0.604251              0.524905   
3         0.373985         0.606492              0.503487   
4         0.319943         0.970707              0.707207   

   http_user_agent_count  ip_count  signature_count    src_ip  cluster_id  \
0               0.987878  0.808556         0.860859  0.642014           0   
1               0.417033  0.130365         0.067021  0.322509           1   
2               0.528679  0.216118         0.041491  0.522445           1   
3               0.780292  0.130404         0.048353  0.911599           1   
4               0.156117  0.719902         0.484865  0.752840           1   

   distance_to_center0  distance_to_center1  
0             0.846099             1.124509  
1             1.175765             0.760310  
2             0.970046             0.615725  
3             1.054555             0.946233  
4             0.640906             1.020849

http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit_predict

【讨论】：

我已经尝试过了——我刚刚得到了一个包含 1 和 0 的矩阵。我的意思是如何将它与原始数据一起使用？
我将在我的示例中添加更多代码，这会将所有内容联系在一起
谢谢。现在更清楚了。但是我现在遇到了另一个问题——我得到了几乎 50/50 的分割，这显然是不正确的。然而，Weka 显示了正确的结果，80/20 分割具有明显的异常值。这是为什么？我在 Weka 中使用 Simple k-means。
我对 weka 不够熟悉，无法告诉你它有什么不同。您可以尝试规范化数据，由于变量的规模，您可能会得到不一致的结果。 scikit-learn.org/stable/modules/generated/…
我已经尝试过 preprocessing.scale 并为 Kmeans 设置了固定的随机值，因此结果不会每次都改变。现在它看起来更准确，使用 71/28 分割。但奇怪的是，结果大多是错误的。属于较小簇的异常数据点由于某种原因进入大簇，而正常数据点进入较小簇。为什么会这样？