sklearn 管道中的变量名称答案

【问题标题】：Name of variables in sklearn pipelinesklearn 管道中的变量名称
【发布时间】：2021-09-21 22:46:06
【问题描述】：

我需要使用 sklearn 库中的 DecisionTreeClassifier。我的数据集中有多个列我必须假装。我的问题是我在结果模型中有变量名 feature_1、feature_2、...、feature_n 的非口语名称。我如何给他们真实姓名？我使用大约 400 列的数据集，因此手动重命名不是理想的方法。谢谢。

import pandas as pd

from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import train_test_split, cross_val_score
from yellowbrick.model_selection import RFECV


raw_data = {'sum': [2345, 256,  43, 643, 34 , 23, 95], 
        'department': ['a1', 'a1', 'a3', 'a3', 'a1', 'a2', 'a2'],
        'sex': ['m', 'neudane', 'f', '', 'f', 'f', 'f']}
df = pd.DataFrame(raw_data, columns = ['sum', 'department', 'sex'])

y = {'y': ['cat_a', 'cat_a', 'cat_b', 'cat_c', 'cat_b', 'cat_a', 'cat_a']}

y = pd.DataFrame(y, columns = ['y'])


categorical = ['department', 'sex']

numerical = ['sum']


X = df[categorical + numerical]


categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(sparse=True, handle_unknown="ignore"))
])

numerical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="mean")),
    ("scaler", StandardScaler())
])



basic_preprocessor = ColumnTransformer([
    #("nominal_preprocessor", nominal_pipeline, nominal),
    ("categorical_preprocessor", categorical_pipeline, categorical),
    ("numerical_preprocessor", numerical_pipeline, numerical)
])


preprocessed = basic_preprocessor.fit_transform(X)


X = preprocessed


from sklearn.model_selection import train_test_split
train, test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

from sklearn import tree
from sklearn.tree import export_text
clf = tree.DecisionTreeClassifier()
clf = clf.fit(train, y_train)


r = export_text(clf)
print(r)



>>>r = export_text(clf)
>>>print(r)
|--- feature_1 <= 0.50
|   |--- feature_7 <= -0.19
|   |   |--- class: cat_b
|   |--- feature_7 >  -0.19
|   |   |--- class: cat_c
|--- feature_1 >  0.50
|   |--- class: cat_a

【问题讨论】：

标签： python scikit-learn sklearn-pandas

【解决方案1】：

有两个关键组件可以帮助完成这项工作。第一个从OneHotEncoder 获取编码名称：OneHotEncoder.get_feature_names_out。具体来说，您在encoder 上使用它作为encoder.get_feature_names_out()。第二个组件是 sklearn.tree.export_text 接受 feature_names 参数。因此，您可以将这些提取的名称直接传递到显示系统中。其他 sklearn 树形显示器也采用该参数（plot_tree、export_graphviz）。

相关的 SO，请参见此处：

sklearn 此处的文档：

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
https://scikit-learn.org/stable/modules/classes.html#module-sklearn.tree（点击这些链接获取树导出/绘图功能）。

以下内容应该适合您（编辑：我忘记了示例中的管道部分。您可以使用my_pipe.named_steps[step_name] 提取OneHotEncoder。您可能必须嵌套它，因为您有嵌套的管道。添加了该示例下面。）：

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier, export_text

import sklearn
print(sklearn.__version__)  # ---> 1.0.2 for me

ftrs = pd.DataFrame({'Sex'     : ['male', 'female']*3, 
                     'AgeGroup': ['0-20', '0-20', 
                                  '20-60', '20-60',
                                  '80+', '80+']})
tgt  = np.array([1, 1, 1, 1, 0, 1])
encoder = OneHotEncoder()
enc_ftrs = encoder.fit_transform(ftrs)
dtc = DecisionTreeClassifier().fit(enc_ftrs, tgt)

encoder_names = encoder.get_feature_names_out()
print(export_text(dtc, feature_names = list(encoder_names)))

这对我来说给出了以下输出：

|--- AgeGroup_80+ <= 0.50
|   |--- class: 1
|--- AgeGroup_80+ >  0.50
|   |--- Sex_female <= 0.50
|   |   |--- class: 0
|   |--- Sex_female >  0.50
|   |   |--- class: 1

包括管道，它看起来像这样：

from sklearn.pipeline import Pipeline
pipe = Pipeline([('enc', OneHotEncoder()),
                 ('dtc', DecisionTreeClassifier())])
pipe.fit(ftrs, tgt)
feature_names = list(pipe.named_steps['enc'].get_feature_names_out())
print(export_text(pipe.named_steps['dtc'],
                  feature_names = feature_names))

输出相同。

【讨论】：