GridSearchCV 可以用于无监督学习吗？答案

【问题标题】：Can GridSearchCV be used for unsupervised learning?GridSearchCV 可以用于无监督学习吗？
【发布时间】：2022-10-25 20:53:09
【问题描述】：

我正在尝试构建一个异常值检测器来查找测试数据中的异常值。该数据略有不同（更多的测试通道，更长的测试）。

首先我应用火车测试拆分，因为我想使用带有火车数据的网格搜索来获得最佳结果。这是来自多个传感器的时间序列数据，我事先删除了时间列。

X shape : (25433, 17)
y shape : (25433, 1)

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.33,
                                                    random_state=(0))

之后标准化，然后我将它们更改为一个 int 数组，因为 GridSearch 似乎不喜欢连续数据。这当然可以做得更好，但我希望在优化编码之前让它工作。

'X'
mean = StandardScaler().fit(X_train)
X_train = mean.transform(X_train)
X_test = mean.transform(X_test)

X_train = np.round(X_train,2)*100
X_train = X_train.astype(int)
X_test = np.round(X_test,2)*100
X_test = X_test.astype(int)

'y'
yeah = StandardScaler().fit(y_train)
y_train = yeah.transform(y_train)
y_test = yeah.transform(y_test)
y_train = np.round(y_train,2)*100
y_train = y_train.astype(int)
y_test = np.round(y_test,2)*100
y_test = y_test.astype(int)

我选择了 IsoForrest 是因为它速度快、结果非常好并且可以处理大量数据集（我目前只使用一部分数据进行测试）。 SVM 也可能是我想检查的一个选项。然后我设置了 GridSearchCV

clf = IForest(random_state=47, behaviour='new',
              n_jobs=-1)

param_grid = {'n_estimators': [20,40,70,100], 
              'max_samples': [10,20,40,60], 
              'contamination': [0.1, 0.01, 0.001], 
              'max_features': [5,15,30], 
              'bootstrap': [True, False]}

fbeta = make_scorer(fbeta_score,
                    average = 'micro',
                    needs_proba=True,
                    beta=1)

grid_estimator = model_selection.GridSearchCV(clf, 
                                              param_grid,
                                              scoring=fbeta,
                                              cv=5,
                                              n_jobs=-1,
                                              return_train_score=True,
                                              error_score='raise',
                                              verbose=3)

grid_estimator.fit(X_train, y_train)

问题：

GridSearchCV 需要一个 y 参数，所以我认为这只适用于监督学习？如果我运行它，我会收到以下我不明白的错误：

ValueError: Classification metrics can't handle a mix of multiclass and continuous-multioutput targets

【问题讨论】：

y_train的类型和clf.predict的类型是什么？他们彼此兼容吗？
y_train 是 int32 的二维数组，clf.predict 是 iForest 的一种方法。这绝对应该一起工作，因为我已经使用了没有 GridSearchCV 的 iForrest。
好的。您应该提供一个可重现的示例。目前，代码不完整，因为它没有 X 和 y 没有给出并且缺少导入行。
我们需要更多信息。你说你在做无监督学习，但你有目标y，它们是连续的。您尝试使用 Fbeta，这是一个（硬）分类指标，并尝试通过概率分数。你真正想要完成什么，你如何衡量成功？
我不允许公开数据……我会尽力提供尽可能多的信息。它浮动的数据，多模式，范围在 -0,8 和 40.000 之间。我使用了 y 目标，因为 GridSearch 会给我一个丢失的 y_true 标签错误。这就是为什么我问 GridSearch 是否只能用于监督学习。

标签： python machine-learning outliers grid-search isolation-forest

【解决方案1】：

您可以使用GridSearchCV 进行无监督学习，但通常很难定义对问题有意义的评分指标。

Here's an example in the docs 使用网格搜索 KernelDensity，一个无监督的估计器。它可以正常工作，因为此估算器具有score 方法（docs）。

在您的情况下，由于 IsolationForest 没有 score 方法，您需要定义一个自定义记分器以作为搜索的 scoring 方法传递。 this question 和 this question 都有答案，但我认为那里给出的指标不一定有意义。不幸的是，我没有想到一个有用的异常值检测指标。这是一个更适合数据科学或统计堆栈交换网站的问题。

【讨论】：