SVR 超参数选择和可视化答案

【问题标题】：SVR hyperparameter selection and visualisationSVR 超参数选择和可视化
【发布时间】：2020-10-29 18:35:09
【问题描述】：

我只是数据分析的初学者。我想使用“交叉验证网格搜索方法”来确定径向基函数 (RBF) 内核 SVM 的参数 gamma 和 C。我不知道我应该将我的数据放在这段代码的什么位置，以及我应该使用什么数据类型应该使用（训练或目标数据）？

对于 SVR

import numpy as np
import pandas as pd
from math import sqrt
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostRegressor
from sklearn.metrics import mean_squared_error,explained_variance_score
from TwoStageTrAdaBoostR2 import TwoStageTrAdaBoostR2 # import the two-stage algorithm
from sklearn import preprocessing
from sklearn import svm
from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from matplotlib.colors import Normalize
from sklearn.svm import SVC

# Data import (source)
source= pd.read_csv(sourcedata)

# Data import (target)
data= pd.read_csv(targetdata)

# Sample Size
datatrain = data.sample(n=60, random_state=1)
datatest = data[~dataL.index.isin(data.index)]

# Merge training set data (source and target)
train = pd.concat([source, datatrain], sort=False)
train.reset_index(inplace=True, drop=True)
datatest.reset_index(inplace=True, drop=True)

# Variable input
X_train, y_train = train[['x1', 'x2']].values, train['y'].values
X_test, y_test = FL[['x1', 'x2']].values, FL['y'].values

# Parameter setting
#sample_size = [n_source1+n_source2+n_source3+n_source4+n_source5, n_target_train]
n_estimators = 100
steps = 8
fold = 5
random_state = np.random.RandomState(1)
sample_size = [350, 60]

#1  twostage tradaboost.r2
regr_1 = TwoStageTrAdaBoostR2(SVR(C=50, gamma='auto'),
                      n_estimators = n_estimators, sample_size = sample_size,
                      steps = steps, fold = fold,
                      random_state = random_state)
regr_1.fit(X_train, y_train)
y_pred1 = regr_1.predict(X_test)
print("MSE of regular two stage trAdaboostR2--model1:",sqrt(mean_squared_error(y_test, y_pred1)))


#Plot the results
plt.figure()
plt.scatter(y_test, y_test-y_pred1, c="black", label="TwoStageTrAdaBoostR2_model1", s=10)
plt.xlabel("CAR")
plt.ylabel("Err")
plt.title("Two-stage Transfer Learning Boosted Decision Tree Regression", loc='left', fontsize=12, fontweight=0, color="orange")
plt.legend()
plt.show()

用于交叉验证网格搜索方法（最佳参数）：

# Cross validation grid search (best parameters) 
parameter_candidates = [
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['linear']},
  {'C': [1, 10, 100, 1000], 'gamma': [0.001, 0.0001], 'kernel': ['rbf']},
]
svr = svm.SVC()
clf = grid_search.GridSearchCV(svr, parameters, c=5 ,n_jobs=-1)
clf.fit(X_train, y_train)
print('Best score for data:', clf.best_score_)
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

用于参数效果的可视化

c_range = np.logspace(-2, 2, 4)
gamma_range = np.logspace(-2, 2, 5)
tuned_parameters = [{'kernel': ['rbf'],'C': c_range,'gamma':gamma_range},
                    {'kernel': ['linear'], 'C': c_range,'gamma':gamma_range}]

svr = svm.SVR()
clf = GridSearchCV(svr,param_grid=tuned_parameters,verbose=2,n_jobs=-1,
                   scoring='explained_variance')
clf.fit(X_train, y_train)

print('Best score for data:', clf.best_score_)
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

# scores for rbf kernel
n = len(gamma_range)*len(c_range)
scores_rbf = clf.cv_results_['mean_test_score'][:n].reshape(len(gamma_range),
                                                            len(c_range))

# scores for rbf kernel
scores_linear = clf.cv_results_['mean_test_score'][n:].reshape(len(gamma_range),
                                                               len(c_range))

class MidpointNormalize(Normalize):

    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))



plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores_rbf, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=0.2, midpoint=0.92))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)), gamma_range, rotation=45)
plt.yticks(np.arange(len(c_range)), c_range)
plt.title('Validation accuracy')
plt.show()

当我使用这段代码时，我发现了下面的输出热图图！

但我正在尝试获得这样的热图

【问题讨论】：

嗨。欢迎来到 SO。您能否为您的问题添加更多上下文以及到目前为止您尝试了什么？例如它是用于回归还是分类？您是否使用了一些标准数据集，例如 iris 或您的数据？
谢谢。我正在使用我的数据进行回归（支持向量回归）。我将我的数据分类为（x_train，y_train）和（x_test，y_test），然后，现在我在确定开始分析的参数时遇到了问题。最后，我想绘制 C 和 gamma 的热图，如所附链接所示。
是否可以使用 SVR 拟合和相应的结果更新您的问题？您应该使用适合的训练集并使用一些典型的 vSVR 参数值。例如svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)，然后是 svr.fit(X_train,y_train)。这将帮助我们确定问题出在哪里，因为您询问应该将数据放在代码中的哪个位置。另外，如果您从网格搜索开始，您也可以发布吗？
我已经对 SVR 没有任何问题，但是如果我可以确定最佳参数，为什么我需要使用那些建议的参数 svr = SVR(kernel='rbf', C=100, gamma=0.1, epsilon=.1)？
使用tuned_parameters = [{'kernel': ['rbf'],'C': [10, 100]},{'kernel': ['linear'], 'C': [10, 100],'epsilon': [1e-3, 1e-4]}] 和svr = svm.SVR(), clf = GridSearchCV(svr,param_grid=tuned_parameters,verbose=2,n_jobs=-1,cv=5,scoring='explained_variance'), clf.fit(X_train, y_train) 我得到了gridsearchcv 的一些结果。您当然可以增加模型参数的粒度。你可以试试这些，如果它不起作用，用错误消息更新问题？一旦获得最佳拟合，您就可以在测试集上运行模型，以查看模型在未见数据上的行为。

标签： scikit-learn data-visualization svm data-analysis grid-search

【解决方案1】：

以下包含一些典型回归数据的代码应该可以一直运行：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV,train_test_split
from matplotlib.colors import Normalize

class MidpointNormalize(Normalize):

    def __init__(self, vmin=None, vmax=None, midpoint=None, clip=False):
        self.midpoint = midpoint
        Normalize.__init__(self, vmin, vmax, clip)

    def __call__(self, value, clip=None):
        x, y = [self.vmin, self.midpoint, self.vmax], [0, 0.5, 1]
        return np.ma.masked_array(np.interp(value, x, y))
    

X, y = datasets.load_boston(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X,y)

# Cross validation grid search (best parameters) 
c_range = np.logspace(-0, 4, 8)
gamma_range = np.logspace(-4, 0, 8)
tuned_parameters = [{'kernel': ['rbf'],'C': c_range,'gamma':gamma_range},
                    {'kernel': ['linear'], 'C': c_range,'gamma':gamma_range}]

svr = svm.SVR()
clf = GridSearchCV(svr,param_grid=tuned_parameters,verbose=20,n_jobs=-4,cv=4,
                   scoring='explained_variance')
clf.fit(X_train, y_train)

print('Best score for data:', clf.best_score_)
print('Best C:',clf.best_estimator_.C) 
print('Best Kernel:',clf.best_estimator_.kernel)
print('Best Gamma:',clf.best_estimator_.gamma)

# scores for rbf kernel
n = len(gamma_range)*len(c_range)
scores_rbf = clf.cv_results_['mean_test_score'][:n].reshape(len(gamma_range),
                                                            len(c_range))

# scores for rbf kernel
scores_linear = clf.cv_results_['mean_test_score'][n:].reshape(len(gamma_range),
                                                               len(c_range))


plt.figure(figsize=(8, 6))
plt.subplots_adjust(left=.2, right=0.95, bottom=0.15, top=0.95)
plt.imshow(scores_rbf, interpolation='nearest', cmap=plt.cm.hot,
           norm=MidpointNormalize(vmin=-.2, midpoint=0.5))
plt.xlabel('gamma')
plt.ylabel('C')
plt.colorbar()
plt.xticks(np.arange(len(gamma_range)),
           [np.format_float_scientific(i,1) for i in gamma_range],rotation=45)
plt.yticks(np.arange(len(c_range)), 
           [np.format_float_scientific(i,) for i in c_range])
plt.title('Validation accuracy')
plt.show()

网格的粒度非常低，否则运行需要一些时间。此外，网格的限制需要比我选择的限制更多。

我不确定您为什么会遇到错误，但我保持简单，并在我的 sn-p 中启动了一次 SVR，以便您了解它是如何工作的。我还为 C 和 gamma 数组使用了不同的长度，这只是为了展示这些参数是如何传递的。有时我发现如果所有内容都具有相同的长度，则很难看出哪个参数负责什么。

最终的绘图看起来像这样，但这在很大程度上取决于网格的范围、它的粒度和您正在使用的数据集。另请注意，我更改了您提供的MidpointNormalize 类的参数。

【讨论】：

非常感谢您的帮助，也非常感谢您的建议。使用您推荐的代码后，我发现此代码行出现此错误ValueError: cannot reshape array of size 32 into shape (13,13) scores = clf.cv_results_['mean_test_score'].reshape(len(gamma_range), len(c_range)) 我使用了C_range = np.logspace(-2, 10, 13) gamma_range = np.logspace(-9, 3, 13)
这可能意味着拟合没有更新，因为您的输入范围 clf.cv_results_['mean_test_score'] 它应该具有尺寸 13x13x2。你还在使用上面的tuned_parameters 吗？请注意，您的C_range 中有一个大写C 在我的代码中是c_range
你说得对，我更改了 c_range 因为我得到了如上问题所示的热图 :( 你的代码运行良好，但我认为热图的最后一个可视化部分有问题阴谋！？plt.imshow(scores_rbf,origin='lower',cmap=plt.cm.hot) plt.xlabel('C') plt.ylabel('gamma') plt.yticks(np.arange(len(gamma_range)), gamma_range) plt.xticks(np.arange(len(c_range)), np.round(c_range,2), rotation=45) plt.colorbar()
我已经编辑了我的答案，试图让它涵盖你对情节的查询。
知道了，我还在努力寻找最佳参数。非常感谢您的帮助：D