scikit 决策树和分类变量的拆分答案

【问题标题】：scikit decision tree and splits for categorical variablesscikit 决策树和分类变量的拆分
【发布时间】：2017-01-27 07:45:40
【问题描述】：

这是我的代码：

from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn import preprocessing
import os
import subprocess

def categorical_split():
    colors = ['blue', 'green', 'yellow', 'green', 'red']
    sizes = ['small', 'large', 'medium', 'large', 'small']

    size_encoder = preprocessing.LabelEncoder()
    sizes = size_encoder.fit_transform(sizes).reshape(-1, 1)

    color_encoder = preprocessing.LabelEncoder()
    colors = size_encoder.fit_transform(colors).reshape(-1, 1)

    dt = DecisionTreeClassifier( random_state=99)
    dt.fit(colors, sizes)

    with open("dt.dot", 'w') as f:
        export_graphviz(dt, out_file=f,
                        feature_names='colors')

    command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
    subprocess.check_call(command)

categorical_split()

生成如下决策树：

由于 scikit-learn 中的决策树不能直接处理分类变量，我不得不使用 LabelEncoder。在图表上，我们看到像c<=1.5 这样的拆分。这种拆分表明分类变量被视为序数变量，拆分是保留顺序。如果我的数据没有顺序，这种方法是有害的。有办法解决吗？如果您打算建议一次性编码，请提供一个示例（代码）它将如何提供帮助。

【问题讨论】：

标签： authentication scikit-learn decision-tree ordinal

【解决方案1】：

这实际上是一种完全有效的方法，不应损害您的模型性能。它确实使模型有点难以阅读。一种不错的方法是使用pd.get_dummies，因为这将为您处理模型名称：

import pandas as pd
df = pd.DataFrame({'colors':colors})
df_encoded = pd.get_dummies(df)
dt.fit(df_encoded, sizes)

with open("dt.dot", 'w') as f:
    export_graphviz(dt, out_file=f,
                    feature_names=df_encoded.columns)

command = ["dot", "-Tpng", "dt.dot", "-o", "dt.png"]
subprocess.check_call(command)

【讨论】：

感谢您的回复！您是否同意我的猜测，即 scikit-learn 将分类变量视为序数并在构建树时保持顺序？
我的意思是在为树选择拆分时保留顺序。
也许，我想我的观点是任何分类值都可以通过一对不等式单独隔离。你不会得到你的分类变量的分组，所以你的树会有比分组所需的更多的叶子，但是你也不会通过单热编码得到这个......这只是一个不幸的弱点scikit learn 没有正确处理分类变量
我想用1.5<=c<=2.5这样的表达式构造一个例子。让我处理它。