值错误：预期输入数据 X 有 1 个特征，但在高斯混合模型中得到 2 个特征答案

【问题标题】：Value error: expected input data X have 1 features, but got 2 features during Gaussian Mixture Models值错误：预期输入数据 X 有 1 个特征，但在高斯混合模型中得到 2 个特征
【发布时间】：2021-08-13 18:00:10
【问题描述】：

我目前正在自学如何使用高斯混合模型来检测异常，但在其中遇到了一些问题。我看过很多博客，但他们似乎没有解释每一行的含义。我正在尝试使用这两个变量（r_max、b_max）找出哪一点是异常值。

这是我的数据集（180 行，4 列）：

r_max | b_max | SPAD | model
255.0 | 46.0  | 35.1   | Redmi 5A
198.0 | 36.0  | 32.5   | Vivo 1820
237.0 | 145.0 | 27.1   | CPH1920

注意：我的 r_max 和 b_max 范围从 0 到 255

首先，我过滤掉了 model，因为它是一个字符串，而 我们不能 fit() 字符串对吗？我还过滤掉了 SPAD，因为它不需要。然后，我更改为使用 fit() 的多维数组，从这篇帖子What is the correct way to fit a gaussian mixture model to single feature data? 可以看出。现在我将拟合（）到 GaussianMixture（）。 不确定人们如何知道要指定多少个 n_components？

dataf = df[['r_max', 'b_max']] 
dataf = np.array(dataf).reshape(-1,1) 

gmm = mixture.GaussianMixture()
gmm.fit(dataf)

当我打印 gmm.fit(dataf) 时我得到了这个：

GaussianMixture()

现在，我将绘制等高线图以查看哪些点是异常值。我指定了 np.linspace(0, 255) 因为 r_max 和 b_max 从 0 到 255 不等。

X, Y = np.meshgrid(np.linspace(0, 255), np.linspace(0, 255))  
XX = np.array([X.ravel(), Y.ravel()]).T

Z = gmm.score_samples(XX)
Z = Z.reshape(X.shape)

CS = plt.contour(X, Y, Z, norm=LogNorm(vmin=1.0, vmax=100.0))
CB = plt.colorbar(CS, shrink=0.8, extend='both')
plt.scatter(dataf['r_max'].values,dataf['b_max'].values)
plt.title('log-likelihood trained using GMM')
plt.axis('tight')
plt.show()

但是，我在这一行中遇到了这个错误：

--> Z = gmm.score_samples(XX)
ValueError: Expected the input data X have 1 features, but got 2 features

当我打印 XX 时，我得到了这个：

[[  0.           0.        ]
 [  5.20408163   0.        ]
 [ 10.40816327   0.        ]
 ...
 [244.59183673 255.        ]
 [249.79591837 255.        ]
 [255.         255.        ]]

当我打印 X 时，我得到了这个：

[[  0.           5.20408163  10.40816327 ... 244.59183673 249.79591837
  255.        ]
 [  0.           5.20408163  10.40816327 ... 244.59183673 249.79591837
  255.        ]
 [  0.           5.20408163  10.40816327 ... 244.59183673 249.79591837
  255.        ]
 ...
 [  0.           5.20408163  10.40816327 ... 244.59183673 249.79591837
  255.        ]
 [  0.           5.20408163  10.40816327 ... 244.59183673 249.79591837
  255.        ]
 [  0.           5.20408163  10.40816327 ... 244.59183673 249.79591837
  255.        ]]

当我打印 Y 时，我得到了这个：

[[  0.           0.           0.         ...   0.           0.
    0.        ]
 [  5.20408163   5.20408163   5.20408163 ...   5.20408163   5.20408163
    5.20408163]
 [ 10.40816327  10.40816327  10.40816327 ...  10.40816327  10.40816327
   10.40816327]
 ...
 [244.59183673 244.59183673 244.59183673 ... 244.59183673 244.59183673
  244.59183673]
 [249.79591837 249.79591837 249.79591837 ... 249.79591837 249.79591837
  249.79591837]
 [255.         255.         255.         ... 255.         255.
  255.        ]]

不知道为什么我有这个错误？我还想知道是否可以仅使用 1 个变量r_max 来绘制等高线图？

修改： 我根据@ronkov 的建议更新了我的代码：

dataf = final_df[['r_max',  'b_max', 'SPAD', 'model']]

dataf = dataf[['r_max', 'b_max']].values

gmm = mixture.GaussianMixture()
gmm.fit(dataf)

X, Y = np.meshgrid(np.linspace(0, 300), np.linspace(0, 300))
XX = np.array([X.ravel(), Y.ravel()]).T
Z = gmm.score_samples(XX)
Z = Z.reshape(X.shape)
# LogNorm only accept positive values, plot -Z
CS = plt.contour(X, Y, -Z, norm=LogNorm(vmax = 300.0), levels=np.logspace(0, 3, 10))
CB = plt.colorbar(CS, shrink=1.0, extend='both')
plt.scatter(dataf[:,0], dataf[:,1], marker = "x", cmap='viridis')
plt.title('log-likelihood trained using GMM')
plt.xlabel('r_max')
plt.ylabel('b_max')
plt.show()

这是我打印的：

【问题讨论】：

标签： python gmm

【解决方案1】：

我认为问题在于您将dataf 重塑为只有一列。要训练算法，我会这样做：

dataf = df[['r_max', 'b_max']].values
gmm = GaussianMixture()
gmm.fit(dataf)

我遇到的另一个问题是 LogNorm 似乎只接受正值（ValueError: math domain error，所以我会使用 plot -Z：

CS = plt.contour(X, Y, Z, norm=LogNorm(vmax=255))

【讨论】：

感谢您的帮助！我已经更新了代码并且它有效。如果我没记错的话，红色簇外的点表示它是异常值，对吗？我如何识别它们并最终将它们从我的数据集中排除？