KMeans 聚类 - 值错误：n_samples=1 应该 >= n_cluster答案

【问题标题】：KMeans clustering - Value error: n_samples=1 should be >= n_clusterKMeans 聚类 - 值错误：n_samples=1 应该 >= n_cluster
【发布时间】：2023-03-09 03:38:01
【问题描述】：

我正在为我的实验使用三个具有不同特征的时间序列数据集进行实验，其格式如下。

    0.086206438,10
    0.086425551,12
    0.089227066,20
    0.089262508,24
    0.089744425,30
    0.090036815,40
    0.090054172,28
    0.090377569,28
    0.090514071,28
    0.090762872,28
    0.090912691,27

第一列是timestamp。出于可重复性的原因，我正在分享数据here。从第 2 列开始，我想读取当前行并将其与前一行的值进行比较。如果它更大，我会继续比较。如果当前值小于前一行的值，我想将当前值（较小）除以前一个值（较大）。因此，这里是代码：

import numpy as np
import matplotlib.pyplot as plt

protocols = {}

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T
    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    protocols[protname] = {
        "col_time": col_time,
        "col_window": col_window,
        "quotient_times": quotient_times,
        "quotient": quotient,
    }

    plt.figure(); plt.clf()
    plt.plot(quotient_times,quotient, ".", label=protname, color="blue")
    plt.ylim(0, 1.0001)
    plt.title(protname)
    plt.xlabel("time")
    plt.ylabel("quotient")
    plt.legend()
    plt.show()

这会产生以下三点 - 我分享的每个 dataset 一个。

从基于上面给出的代码的图中的点可以看出，data1 的值非常一致，其值约为 1，data2 将有两个商（其值将集中在 0.5 或 0.8 左右） data3 的值集中在两个值附近（大约 0.5 或 0.7）。这样，给定一个新数据点（带有quotient 和quotient_times），我想通过构建堆叠这两个转换特征quotient 和quotient_times 的每个数据集来知道它属于哪个cluster。我正在尝试使用KMeans 集群，如下所示

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)

但这给了我一个错误：ValueError: n_samples=1 should be >= n_clusters=3。我们如何解决这个错误？

更新：samlpe 商数据 = array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])

【问题讨论】：

标签： python-3.x machine-learning scikit-learn cluster-analysis k-means

【解决方案1】：

按原样，您的 quotient 变量现在是一个样本；这里我得到一个不同的错误信息，可能是由于不同的 Python/scikit-learn 版本，但本质是一样的：

import numpy as np
quotient = np.array([ 0.7 , 0.7 , 0.4973262 , 0.7008547 , 0.71287129, 0.704 , 0.49723757, 0.49723757, 0.70676692, 0.5 , 0.5 , 0.70754717, 0.5 , 0.49723757, 0.70322581, 0.5 , 0.49723757, 0.49723757, 0.5 , 0.49723757])
quotient.shape
# (20,)

from sklearn.cluster import KMeans
k_means = KMeans(n_clusters=3, random_state=0)
k_means.fit(quotient)

这会产生以下错误：

ValueError: Expected 2D array, got 1D array instead:
array=[0.7        0.7        0.4973262  0.7008547  0.71287129 0.704
 0.49723757 0.49723757 0.70676692 0.5        0.5        0.70754717
 0.5        0.49723757 0.70322581 0.5        0.49723757 0.49723757
 0.5        0.49723757].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

尽管措辞不同，但与您的并无不同 - 本质上它表示您的数据看起来像一个样本。

遵循第一个建议（即考虑到 quotient 包含单个功能（列）解决了问题：

k_means.fit(quotient.reshape(-1,1))
# result
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=0, tol=0.0001, verbose=0)

【讨论】：

【解决方案2】：

请尝试以下代码。关于我所做的事情的简要说明：

首先我构建了数据集sample = np.vstack((quotient_times, quotient)).T 并对其进行了标准化，因此聚类会变得更容易。接下来，我将DBScan 与多个超参数（eps 和 min_samples）一起应用，直到找到更好地分离点的那个。最后，我用各自的标签绘制了数据，因为您使用的是二维数据，所以很容易可视化聚类的效果。

import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

types = {"data1": "data1.csv", "data2": "data2.csv", "data3": "data3.csv"}

dataset = np.empty((0, 2))

for protname, fname in types.items():
    col_time,col_window = np.loadtxt(fname,delimiter=',').T

    trailing_window = col_window[:-1] # "past" values at a given index
    leading_window  = col_window[1:]  # "current values at a given index
    decreasing_inds = np.where(leading_window < trailing_window)[0]
    quotient = leading_window[decreasing_inds]/trailing_window[decreasing_inds]
    quotient_times = col_time[decreasing_inds]

    sample = np.vstack((quotient_times, quotient)).T
    dataset = np.append(dataset, sample, axis=0)

scaler = StandardScaler()
dataset = scaler.fit_transform(dataset)

k_means = DBSCAN(eps=0.6, min_samples=1)
k_means.fit(dataset)

colors = [i for i in k_means.labels_]

plt.figure();
plt.title('Dataset 1,2,3')
plt.xlabel("time")
plt.ylabel("quotient")
plt.scatter(dataset[:, 0], dataset[:, 1], c=colors)
plt.legend()
plt.show()

【讨论】：

谢谢你，你太棒了。但是为什么我们有负商呢？它应该是一个介于 0 和 1 之间的数字。是否也可以将其中的 3 个绘制在一个图中，以便我们可以看到集群的外观。
因为我应用了scaler.fit_transform(dataset) sn-p。如果您想了解更多，请参考Feature scaling。这绝对是可能的，您只需在应用 DBScan 之前合并所有数据集。