【问题标题】:Having trouble creating a scatter plot for my kmeans clustering data为我的 kmeans 聚类数据创建散点图时遇到问题
【发布时间】:2020-11-18 16:21:19
【问题描述】:

我正在尝试使用 kmeans 聚类对一个简单的数据集执行一些异常检测。

我有两个变量中的一些数据。 x 和 x_ax。
这是 x 数据的示例

array([[44360.125],
       [56385.958333333336],
       [61500.5],
       [61227.375],
       [60049.333333333336],
       [51396.916666666664],
       [49225.208333333336],
       [63211.083333333336],
       [64631.916666666664],
       [62546.708333333336],
       [62825.125],

x_ax 数据是时间戳值...

array([Timestamp('2018-01-01 00:00:00'), Timestamp('2018-01-02 00:00:00'),
   Timestamp('2018-01-03 00:00:00'), Timestamp('2018-01-04 00:00:00'),
   Timestamp('2018-01-05 00:00:00'), Timestamp('2018-01-06 00:00:00'),
   Timestamp('2018-01-07 00:00:00'), Timestamp('2018-01-08 00:00:00'),
   Timestamp('2018-01-09 00:00:00'), Timestamp('2018-01-10 00:00:00'),
   Timestamp('2018-01-11 00:00:00'), Timestamp('2018-01-12 00:00:00'),
   Timestamp('2018-01-13 00:00:00'), Timestamp('2018-01-14 00:00:00'),
   Timestamp('2018-01-15 00:00:00'), Timestamp('2018-01-16 00:00:00'),
   Timestamp('2018-01-17 00:00:00'), Timestamp('2018-01-18 00:00:00'),
   Timestamp('2018-01-19 00:00:00'), Timestamp('2018-01-20 00:00:00'),
   Timestamp('2018-01-21 00:00:00'), Timestamp('2018-01-22 00:00:00'),
   Timestamp('2018-01-23 00:00:00'), Timestamp('2018-01-24 00:00:00'),

想法是x数据中的第一个值与x_ax数据中的第一个元素相关。

即2018-01-01 --> 44360.125

我实例化了一个 Kmeans 集群实例:

 kmeans = KMeans(n_clusters=1, random_state=0).fit(x)
 center = kmeans.cluster_centers_

然后我计算 x 中每个点到中心的距离,对其进行排序,然后提取到中心的前 5 个最大距离(即我的潜在异常)。

distance = sqrt((x - center)**2)
order_index = argsort(distance, axis = 0)
indexes = order_index[-5:]
values = x[indexes]

然后我尝试将这些数据绘制为散点图,其中红色标记的点是潜在的异常。

plt.plot(x_ax, x)
plt.scatter(indexes, values, color='r')
plt.show()

不幸的是,我得到了一个看起来像这样的情节:

y 轴刻度似乎是正确的,但为什么 x 轴刻度范围从 0 到 4000,第一个值以 2000 为增量,然后再增加 400?

另外,为什么我的绘图除了左下角的一个红点外,右上角的所有值都是一条直线?

任何帮助表示赞赏。

【问题讨论】:

  • plt.plot() 创建一个线图,即右上角的线。您希望 x 轴是什么?
  • 此外,如果您在一个集群中使用 Kmeans,则只需获取数据的平均值即可,即 center= np.mean(x)

标签: python pandas matplotlib scikit-learn


【解决方案1】:

使用以下代码

import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from datetime import datetime
import pandas as pd
x = np.array([[44360.125],
       [56385.958333333336],
       [61500.5],
       [61227.375],
       [60049.333333333336],
       [51396.916666666664],
       [49225.208333333336],
       [63211.083333333336],
       [64631.916666666664],
       [62546.708333333336],
       [62825.125]])


x_ax =np.array(['2018-01-01 00:00:00',  '2018-01-02 00:00:00',
    '2018-01-03 00:00:00',  '2018-01-04 00:00:00',
    '2018-01-05 00:00:00',  '2018-01-06 00:00:00',
    '2018-01-07 00:00:00',  '2018-01-08 00:00:00',
    '2018-01-09 00:00:00',  '2018-01-10 00:00:00',
    '2018-01-11 00:00:00'])

x_ax = np.array([p.replace(" 00:00:00","") for p in x_ax]) #Remove the 00:00:00
x_ax = [datetime.strptime(p,"%Y-%m-%d") for p in x_ax] #convert to datetime
x_ax = np.array([pd.Timestamp(p) for p in x_ax]) #Convert to Timestamp

x_ax
#array([Timestamp('2018-01-01 00:00:00'), Timestamp('2018-01-02 00:00:00'),
#       Timestamp('2018-01-03 00:00:00'), Timestamp('2018-01-04 00:00:00'),
#       Timestamp('2018-01-05 00:00:00'), Timestamp('2018-01-06 00:00:00'),
#       Timestamp('2018-01-07 00:00:00'), Timestamp('2018-01-08 00:00:00'),
#       Timestamp('2018-01-09 00:00:00'), Timestamp('2018-01-10 00:00:00'),
#       Timestamp('2018-01-11 00:00:00')], dtype=object)

#Fit kmeans
kmeans = KMeans(n_clusters=1, random_state=0).fit(x)
center = kmeans.cluster_centers_

#Get outliers
distance = np.sqrt((x - center)**2)
order_index = np.argsort(distance, axis = 0)
indexes = order_index[-5:].flatten() #added .flatten()
values = x[indexes].flatten() #added .flatten()

#Plot it 
fig,ax = plt.subplots(1,1,figsize=(15,10))
plt.plot(x_ax,x)
plt.scatter(x_ax[indexes], values, color='r')

我得到一个带有正确 x 轴的图

注意

K-means 通过重复更新集群的平均值来工作。当您只有一个集群时,center 相当于计算数据的平均值:

print(np.mean(x)) # 57941.84090909091
print(center) # array([[57941.84090909091]])

【讨论】:

    【解决方案2】:

    您只需添加 fig, ax = plt.subplots(1,1) 即可,一切正常:

    import numpy as np
    import pandas as pd
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    
    x = np.array([[44360.125],
                  [56385.958333333336],
                  [61500.5],
                  [61227.375],
                  [60049.333333333336],
                  [51396.916666666664],
                  [49225.208333333336],
                  [63211.083333333336],
                  [64631.916666666664],
                  [62546.708333333336],
                  [62825.125]])
    
    x_ax = np.array([pd.Timestamp('2018-01-01 00:00:00'),
                     pd.Timestamp('2018-01-02 00:00:00'),
                     pd.Timestamp('2018-01-03 00:00:00'),
                     pd.Timestamp('2018-01-04 00:00:00'),
                     pd.Timestamp('2018-01-05 00:00:00'),
                     pd.Timestamp('2018-01-06 00:00:00'),
                     pd.Timestamp('2018-01-07 00:00:00'),
                     pd.Timestamp('2018-01-08 00:00:00'),
                     pd.Timestamp('2018-01-09 00:00:00'),
                     pd.Timestamp('2018-01-10 00:00:00'),
                     pd.Timestamp('2018-01-11 00:00:00'),
                     pd.Timestamp('2018-01-12 00:00:00'),
                     pd.Timestamp('2018-01-13 00:00:00'),
                     pd.Timestamp('2018-01-14 00:00:00'),
                     pd.Timestamp('2018-01-15 00:00:00'),
                     pd.Timestamp('2018-01-16 00:00:00'),
                     pd.Timestamp('2018-01-17 00:00:00'),
                     pd.Timestamp('2018-01-18 00:00:00'),
                     pd.Timestamp('2018-01-19 00:00:00'),
                     pd.Timestamp('2018-01-20 00:00:00'), 
                     pd.Timestamp('2018-01-21 00:00:00'),
                     pd.Timestamp('2018-01-22 00:00:00'),
                     pd.Timestamp('2018-01-23 00:00:00'),
                     pd.Timestamp('2018-01-24 00:00:00')])
    
    x_ax=x_ax[:11]
    x_ax
    
    kmeans = KMeans(n_clusters=1, random_state=0).fit(x)
    center = kmeans.cluster_centers_
    
    distance = ((x - center)**2)**.5
    order_index = np.argsort(distance, axis = 0)
    indexes = order_index[-5:]
    values = x[indexes]
    
    fig, ax = plt.subplots(1,1) # Added
    plt.plot(x_ax,x)
    plt.scatter(x_ax[indexes], values, color='r')
    plt.show()
    

    结果:

    【讨论】:

      猜你喜欢
      • 2018-04-18
      • 1970-01-01
      • 2021-12-03
      • 2011-10-24
      • 1970-01-01
      • 2017-07-08
      • 2020-11-07
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多