【问题标题】:Python 3 and Sklearn: Difficulty to use a NOT-sklearn model as a sklearn modelPython 3 和 Sklearn:难以将 NOT-sklearn 模型用作 sklearn 模型
【发布时间】:2020-05-31 08:52:12
【问题描述】:

下面的代码正在运行。我只有一个例程来使用之前在 sklearn 中定义的线性模型来运行交叉验证方案。我对此没有意见。我的问题是:如果我用model=RBF('multiquadric') 替换代码model=linear_model.LinearRegression()(请参见__main__ 中的第14 和15 行,它不再起作用了。所以我的问题实际上出在我尝试的类RBF 中模仿 sklearn 模型。

如果我替换上述代码,我会收到以下错误:

  FitFailedWarning)
/home/daniel/anaconda3/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:536: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
ValueError: All arrays must be equal length.

  FitFailedWarning)

1) 我应该在 RBF 类中定义一个评分函数吗?

2) 怎么做?我迷路了。由于我继承了 BaseEstimator 和 RegressorMixin,我希望这在内部得到了解决。

3) 还有什么遗漏的吗?

from sklearn import datasets
import numpy as np
import pandas as pd
from sklearn import linear_model
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from scipy.interpolate import Rbf
np.random.seed(0)
from sklearn.base import BaseEstimator, RegressorMixin



class RBF(BaseEstimator, RegressorMixin):
    def __init__(self,function):
        self.function=function
    def fit(self,x,y):
        self.rbf = Rbf(x, y,function=self.function)
    def predict(self,x):   
        return self.rbf(x)    


if __name__ == "__main__":
    # Load Data
    targetName='HousePrice'
    data=datasets.load_boston()
    featuresNames=list(data.feature_names)
    featuresData=data.data
    targetData = data.target
    df=pd.DataFrame(featuresData,columns=featuresNames)
    df[targetName]=targetData
    independent_variable_list=featuresNames
    dependent_variable=targetName
    X=df[independent_variable_list].values
    y=np.squeeze(df[[dependent_variable]].values)    
    # Model Definition    
    model=linear_model.LinearRegression()
    #model=RBF('multiquadric')    
    # Cross validation routine
    number_splits=5
    score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
    kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
    scalar = StandardScaler()
    pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
    results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
    for score in score_list:
        print(score+':')
        print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
        print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))

【问题讨论】:

    标签: python-3.x scikit-learn cross-validation


    【解决方案1】:

    让我们看一下文档here

    *args : 数组

    x, y, z, ..., d,其中 x, y, z, ... 是节点的坐标,d 是节点处的值数组

    所以它采用可变长度参数,最后一个参数是 y 在您的情况下的值。参数k 是所有数据点的kth 坐标(所有其他参数z, y, z, … 相同。

    按照文档,您的代码应该是

    from sklearn import datasets
    import numpy as np
    import pandas as pd
    from sklearn import linear_model
    from sklearn import model_selection
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler
    from scipy.interpolate import Rbf
    np.random.seed(0)
    from sklearn.base import BaseEstimator, RegressorMixin
    
    class RBF(BaseEstimator, RegressorMixin):
        def __init__(self,function):
            self.function=function
        def fit(self,X,y):        
            self.rbf = Rbf(*X.T, y,function=self.function)
    
        def predict(self,X):   
            return self.rbf(*X.T)
    
    
    # Load Data
    data=datasets.load_boston()
    
    X = data.data
    y = data.target
    
    
    number_splits=5
    score_list=['neg_mean_squared_error','neg_mean_absolute_error','r2']
    
    kfold = model_selection.KFold(n_splits=number_splits,shuffle=True, random_state=0)
    scalar = StandardScaler()
    
    model = RBF(function='multiquadric')
    
    pipeline = Pipeline([('transformer', scalar), ('estimator', model)])
    
    results = model_selection.cross_validate(pipeline, X, y, cv=kfold, scoring=score_list,return_train_score=True)
    
    for score in score_list:
            print(score+':')
            print('Train: '+'Mean',np.mean(results['train_'+score]),'Standard Error',np.std(results['train_'+score]))
            print('Test: '+'Mean',np.mean(results['test_'+score]),'Standard Error',np.std(results['test_'+score]))
    

    输出

    neg_mean_squared_error:
    Train: Mean -1.552450953914355e-20 Standard Error 7.932530906290208e-21
    Test: Mean -23.007377210596463 Standard Error 4.254629143836107
    neg_mean_absolute_error:
    Train: Mean -9.398502208736061e-11 Standard Error 2.4673749061941226e-11
    Test: Mean -3.1319779583728673 Standard Error 0.2162343985534446
    r2:
    Train: Mean 1.0 Standard Error 0.0
    Test: Mean 0.7144217179633185 Standard Error 0.08526294242760363
    

    为什么*X.T :正如我们看到的,每个参数对应于所有数据点的轴,所以我们转置它们,然后使用* 运算符展开并传递每个子数组作为变长函数的参数。

    看起来最新的实现有一个mode 参数,我们可以直接传递N-D 数组。

    【讨论】:

      猜你喜欢
      • 2023-01-03
      • 2019-04-26
      • 2021-09-16
      • 1970-01-01
      • 2022-12-04
      • 2021-09-18
      • 2017-02-19
      • 2018-04-11
      • 1970-01-01
      相关资源
      最近更新 更多