【发布时间】:2021-03-10 09:47:58
【问题描述】:
我正在使用sklearns' pipeline 函数、one hot encode 和model。几乎与this 帖子中的完全一样。
使用Pipeline 后,我无法再获得树的贡献。收到此错误:
AttributeError: 'Pipeline' 对象没有属性 'n_outputs_'
我尝试使用treeinterpreter 的参数,但我被卡住了。
因此我的问题是:当我们使用 sklearns Pipeline 时,有什么方法可以从树中获取贡献?
编辑 2 - Venkatachalam 要求的真实数据:
# Data DF to train model
df = pd.DataFrame(
[['SGOHC', 'd', 'onetwothree', 'BAN', 488.0580347, 960 ,841, 82, 0.902497027, 841 ,0.548155625 ,0.001078211, 0.123958333 ,1],
['ABCDEFGHIJK', 'SOC' ,'CON','CAN', 680.84, 1638, 0, 0, 0 ,0 ,3.011140743 ,0.007244358, 1 ,0],
['Hello', 'AA', 'onetwothree', 'SPEAKER', 5823.230967, 2633, 1494 ,338 ,0.773761714 ,1494, 12.70144386 ,0.005743015, 0.432586403, 8]],
columns=['B','C','D','E','F','G','H','I','J','K','L','M', 'N', 'target'])
# Create test and train set (useless, but for the example...)
from sklearn.model_selection import train_test_split
# Define X and y
X = df.drop('target', axis=1)
y = df['target']
# Create Train and Test Sets
X_train, X_validation, Y_train, Y_validation = train_test_split(X, y, test_size=0.20, random_state=1)
# Make the pipeline and model
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import ParameterGrid
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
rfr = Pipeline([('preprocess',
ColumnTransformer([('ohe',
OneHotEncoder(handle_unknown='ignore'), [1])])),
('rf', RandomForestRegressor())])
rfr.fit(X_train, Y_train)
# The New, Real data that we need to predict & explain!
new_data = pd.DataFrame(
[['DEBTYIPL', 'de', 'onetwothreefour', 'BANAAN', 4848.0580347, 923460 ,823441, 5, 0.902497027, 43 ,0.548155625 ,0.001078211, 0.123958333 ],
['ABCDEFGHIJK', 'SOC' ,'CON','CAN23', 680.84, 1638, 0, 0, 0 ,0 ,1.011140743 ,4.007244358, 1 ],
['Hello_NO', 'AAAAa', 'onetwothree', 'SPEAKER', 5823.230967, 123, 32 ,22 ,0.773761714 ,1678, 12.70144386 ,0.005743015, 0.432586403]],
columns=['B','C','D','E','F','G','H','I','J','K','L','M', 'N'])
new_data.head()
# Predicting the values
rfr.predict(new_data)
# Now the error... the contributions:
from treeinterpreter import treeinterpreter as ti
prediction, bias, contributions = ti.predict(rfr[-1], rfr[:-1].fit_transform(new_data))
#ValueError: Number of features of the model must match the input. Model n_features is 2 and input n_features is 3
【问题讨论】:
标签: python numpy scikit-learn random-forest