带有 scikit-learn 的用户定义的 SVM 内核答案

【问题标题】：User defined SVM kernel with scikit-learn带有 scikit-learn 的用户定义的 SVM 内核
【发布时间】：2015-10-14 11:34:50
【问题描述】：

我在scikit-learn中自己定义内核时遇到问题。我自己定义了高斯核，能够拟合 SVM，但不能用它来进行预测。

更准确地说，我有以下代码

from sklearn.datasets import load_digits
from sklearn.svm import SVC
from sklearn.utils import shuffle
import scipy.sparse as sparse
import numpy as np


digits = load_digits(2)
X, y = shuffle(digits.data, digits.target)

gamma = 1.0


X_train, X_test = X[:100, :], X[100:, :]
y_train, y_test = y[:100], y[100:]

m1 = SVC(kernel='rbf',gamma=1)
m1.fit(X_train, y_train)
m1.predict(X_test)

def my_kernel(x,y):
    d = x - y
    c = np.dot(d,d.T)
    return np.exp(-gamma*c)

m2 = SVC(kernel=my_kernel)
m2.fit(X_train, y_train)
m2.predict(X_test)

m1 和 m2 应该是一样的，但是 m2.predict(X_test) 返回错误：

操作数不能与形状一起广播 (260,64) (100,64)

我不明白这个问题。

此外，如果 x 是一个数据点，则 m1.predict(x) 给出 +1/-1 结果，正如预期的那样，但 m2.predict(x) 给出 +1/-1 数组... 不知道为什么。

【问题讨论】：

你的内核函数错误。
更精确吗？我要使用的核函数是K(x,y) = exp(- gamma * || x-y||^2)，有什么问题？
我的意思是您对my_kernel 的实现是错误的，如答案所示。

标签： python-2.7 machine-learning scikit-learn svm libsvm

【解决方案1】：

错误出现在x - y 行。你不能这样减去两者，因为两者的第一个维度可能不相等。以下是rbf 内核在 scikit-learn 中的实现方式，取自here（仅保留基本要素）：

def row_norms(X, squared=False):

    if issparse(X):
        norms = csr_row_norms(X)
    else:
        norms = np.einsum('ij,ij->i', X, X)

    if not squared:
        np.sqrt(norms, norms)
    return norms

def euclidean_distances(X, Y=None, Y_norm_squared=None, squared=False):
   """
    Considering the rows of X (and Y=X) as vectors, compute the
    distance matrix between each pair of vectors.

    [...]


    Returns
    -------
    distances : {array, sparse matrix}, shape (n_samples_1, n_samples_2)
   """
    X, Y = check_pairwise_arrays(X, Y)

    if Y_norm_squared is not None:
        YY = check_array(Y_norm_squared)
        if YY.shape != (1, Y.shape[0]):
            raise ValueError(
                "Incompatible dimensions for Y and Y_norm_squared")
    else:
        YY = row_norms(Y, squared=True)[np.newaxis, :]

    if X is Y:  # shortcut in the common case euclidean_distances(X, X)
        XX = YY.T
    else:
        XX = row_norms(X, squared=True)[:, np.newaxis]

    distances = safe_sparse_dot(X, Y.T, dense_output=True)
    distances *= -2
    distances += XX
    distances += YY
    np.maximum(distances, 0, out=distances)

    if X is Y:
        # Ensure that distances between vectors and themselves are set to 0.0.
        # This may not be the case due to floating point rounding errors.
        distances.flat[::distances.shape[0] + 1] = 0.0

    return distances if squared else np.sqrt(distances, out=distances)

def rbf_kernel(X, Y=None, gamma=None):

    X, Y = check_pairwise_arrays(X, Y)
    if gamma is None:
        gamma = 1.0 / X.shape[1]

    K = euclidean_distances(X, Y, squared=True)
    K *= -gamma
    np.exp(K, K)    # exponentiate K in-place
    return K

您可能想要更深入地研究代码，但请查看 euclidean_distances 函数的 cmets。你想要实现的一个天真的实现是这样的：

def my_kernel(x,y):
    d = np.zeros((x.shape[0], y.shape[0]))
    for i, row_x in enumerate(x):
        for j, row_y in enumerate(y):
            d[i, j] = np.exp(-gamma * np.linalg.norm(row_x - row_y))

    return d

【讨论】：

我一直在遵循与您相同的思路，并尝试使其在数据集上真正起作用。导入sklearn pairwise.euclidean_distances 后，我似乎无法使其工作...euclidean_distances(X, y) 将返回 ValueError：X 和 Y 矩阵的尺寸不兼容。
@JulienMarrec 你做了什么？ euclidean_distances 为我工作。
@JulienMarrec - 你传递给它什么？两者的最后一个维度应该相等（列数相同）
感谢您的评论，我会继续。我只想指出，用户定义的内核函数是否应该采用两个数据矩阵，这一点真的很不清楚。特别是网络上的例子可以理解为一个以两个数据点为参数并返回一个实数的内核函数。
@IVlad，我将上面的原始帖子中的 X，y 传递给 digits = load_digits(2) X, y = shuffle(digits.data, digits.target)