scikit-learn 中预计算内核的网格搜索中的嵌套交叉验证答案

【问题标题】：Nested cross-validation in grid search for precomputed kernels in scikit-learnscikit-learn 中预计算内核的网格搜索中的嵌套交叉验证
【发布时间】：2014-08-27 01:36:05
【问题描述】：

我有一个大小为 NxN 的预计算内核。我正在使用 GridSearchCV 来调整带有 kernel='precomputed' 的 SVM 的 C 参数，如下所示：

C_range = 10. ** np.arange(-2, 9)
param_grid = dict(C=C_range)
grid = GridSearchCV(SVC(kernel='precomputed'), param_grid=param_grid, cv=StratifiedKFold(y=data_label, n_folds=10))
grid.fit(kernel, data_label)
print grid.best_score_

这工作得很好，但是由于我使用完整数据进行预测（使用 grid.predict(kernel)），它会过拟合（我得到的精度/召回率 = 1.0 大多数时候）。

所以我想首先通过交叉验证将我的数据分成 10 个块（9 个用于训练，1 个用于测试），并且在每个折叠中，我想运行 GridSearch 来调整训练集上的 C 值，并且在测试集上进行测试。

为了做到这一点，我将内核矩阵切成 100x100 和 50x50 子矩阵，在其中一个上运行 grid.fit()，另一个上运行 grid.predict()。

但我收到以下错误：

ValueError: X.shape[1] = 50 should be equal to 100, the number of features at training time

我猜训练内核应该与测试内核具有相同的维度，但我不明白为什么，因为我只是计算 100x100 和 50x50 的 np.dot(X, XT)，因此最终内核具有不同的维度..

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

scikit learn doc 说：

设置 kernel='precomputed' 并在 fit 方法中传递 Gram 矩阵而不是 X。目前，必须提供所有训练向量和测试向量之间的核值。

所以我猜不可能使用预先计算的内核进行（简单）交叉验证。

【讨论】：

【解决方案2】：

自定义网格搜索相当简单，但据我所知，六年后，在 sklearn 中仍然没有内置的方法。这是一个简单的 sn-p，它可以帮助我调整 C 参数：

import numpy as np
from sklearn.model_selection import ShuffleSplit
from sklearn.svm import SVC

def precomputed_kernel_GridSearchCV(K, y, Cs, n_splits=5, test_size=0.2, random_state=42):
    """A version of grid search CV, 
    but adapted for SVM with a precomputed kernel
    K (np.ndarray) : precomputed kernel
    y (np.array) : labels
    Cs (iterable) : list of values of C to try
    return: optimal value of C
    """
    from sklearn.model_selection import ShuffleSplit
 
    n = K.shape[0]
    assert len(K.shape) == 2
    assert K.shape[1] == n
    assert len(y) == n
    
    best_score = float('-inf')
    best_C = None
 
    indices = np.arange(n)
    
    for C in Cs:
        # for each value of parameter, do K-fold
        # The performance measure reported by k-fold cross-validation 
        # is the average of the values computed in the loop
        scores = []
        ss = ShuffleSplit(n_splits=n_splits, test_size=test_size, random_state=random_state)
        for train_index, test_index in ss.split(indices):
            K_train = K[np.ix_(train_index,train_index)]
            K_test = K[np.ix_(test_index, train_index)]
            y_train = y[train_index]
            y_test = y[test_index]
            svc = SVC(kernel='precomputed', C=C)
            svc.fit(K_train, y_train)
            scores.append(svc.score(K_test, y_test))
        if np.mean(scores) > best_score:
            best_score = np.mean(scores)
            best_C = C
    return best_C

【讨论】：