【问题标题】:Save intermenidte results of the sklearn pipeline保存 sklearn 管道的中间结果
【发布时间】:2020-02-06 16:11:13
【问题描述】:

我有一个代码示例 - sklearn 管道,它有两个组件(PCA 和随机森林),我想使用管道的中间结果来带来一些可解释性。我知道可以使用 .get_params() 来查看中间步骤,但是是否可以保存或提取中间结果以进行其他操作?我想应用 PCA 的附加功能(代码中的 1.1. 和 1.2 部分)

from sklearn.datasets import load_breast_cancer
import numpy as np
import pandas as pd
from sklearn.decomposition import FastICA, PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

#Convert the dataset to data frame
cancer = load_breast_cancer()     
data = np.c_[cancer.data, cancer.target]
columns = np.append(cancer.feature_names, ["target"])
df = pd.DataFrame(data, columns=columns)


#Split data into train and test 
X = df.iloc[:, 0:30].values
Y = df.iloc[:, 30].values
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)


#Create a pipeline 
n_comp = 12
clf = Pipeline([('pca', PCA(n_comp)), ('RandomForest', RandomForestClassifier(n_estimators=100))])
clf.fit(X_train, Y_train)


#Evalute the pipeline 
cr = classification_report(Y_test, Y_pred)
print(cr)


#see the intermediate steps of the pipeline
print(clf.get_params()['pca'])


##1.1 if I create PCA outside of the pipeline 
pca = PCA(n_components=10)
principalComponents = pca.fit_transform(X)

##1.2 some explainability on pca outside of the pipeline 
pca.explained_variance_ratio_

【问题讨论】:

    标签: python scikit-learn pipeline


    【解决方案1】:

    我们可以将get_params() 分配给一个应该返回sklearn.decomposition.pca.PCA 类型对象的变量。这样,我们就可以访问分解的所有方法和属性了。

    from sklearn.datasets import load_breast_cancer
    import numpy as np
    import pandas as pd
    from sklearn.decomposition import FastICA, PCA
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report
    from sklearn.metrics import confusion_matrix
    
    #Convert the dataset to data frame
    cancer = load_breast_cancer()     
    data = np.c_[cancer.data, cancer.target]
    columns = np.append(cancer.feature_names, ["target"])
    df = pd.DataFrame(data, columns=columns)
    
    
    #Split data into train and test 
    X = df.iloc[:, 0:30].values
    Y = df.iloc[:, 30].values
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.25, random_state = 0)
    
    
    #Create a pipeline 
    n_comp = 12
    clf = Pipeline([('pca', PCA(n_comp)), ('RandomForest', RandomForestClassifier(n_estimators=100))])
    clf.fit(X_train, Y_train)
    
    
    ### --- ###
    pca = clf.get_params()['pca']
    
    type(pca)
    #sklearn.decomposition.pca.PCA
    
    pca.explained_variance_ratio_
    #array([9.81327198e-01, 1.67333696e-02, 1.73934848e-03, 1.05758996e-04,
    #       8.29268494e-05, 6.34081771e-06, 3.75309113e-06, 7.08990845e-07,
    #       3.16742542e-07, 1.75055859e-07, 7.11274270e-08, 1.43003803e-08])
    
    pca.components_.shape
    #(12, 30)
    

    希望这会有所帮助。

    【讨论】:

      猜你喜欢
      • 2015-11-01
      • 2021-03-09
      • 1970-01-01
      • 2020-02-16
      • 2020-10-21
      • 2017-08-15
      • 1970-01-01
      • 2016-03-26
      • 1970-01-01
      相关资源
      最近更新 更多