决策树 - 找出当遍历树时常量预测如何变化答案

【问题标题】：Decision tree - find how constant prediction changes as tree is traversed决策树 - 找出当遍历树时常量预测如何变化
【发布时间】：2019-04-28 12:25:09
【问题描述】：

假设我有以下DecisionTreeClassifier 模型：

from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_breast_cancer

bunch = load_breast_cancer()

X, y = bunch.data, bunch.target

model = DecisionTreeClassifier(random_state=100)
model.fit(X, y)

我想遍历这棵树中的每个节点（叶节点和决策节点），并确定预测值在遍历树时如何变化。基本上，对于给定的样本，我希望能够告诉最终预测（.predict 返回的内容）是如何确定的。所以也许样本最终会被预测为1，但是会遍历四个节点，并且在每个节点上，它的“常量”（scikit 文档中使用的语言）预测从1 到0 再到0 到1 .

目前还不清楚我是如何从model.tree_.value 获得这些信息的，它被描述为：

 |  value : array of double, shape [node_count, n_outputs, max_n_classes]
 |      Contains the constant prediction value of each node.

在这个模型的情况下看起来像：

>>> model.tree_.value.shape
(43, 1, 2)
>>> model.tree_.value
array([[[212., 357.]],

       [[ 33., 346.]],

       [[  5., 328.]],

       [[  4., 328.]],

       [[  2., 317.]],

       [[  1.,   6.]],

       [[  1.,   0.]],

       [[  0.,   6.]],

       [[  1., 311.]],

       [[  0., 292.]],

       [[  1.,  19.]],

       [[  1.,   0.]],

       [[  0.,  19.]],

有谁知道我怎么能做到这一点？上面 43 个节点中每个节点的类预测是否只是每个列表的 argmax？那么1、1、1、1、1、1、0、0、...，从上到下？

【问题讨论】：

标签： python machine-learning scikit-learn

【解决方案1】：

一种解决方案可能是直接走到树中的决策路径。您可以调整this solution，将整个决策树打印为 if 子句。以下是解释一个实例的快速改编：

def tree_path(instance, values, left, right, threshold, features, node, depth):
    spacer = '    ' * depth
    if (threshold[node] != _tree.TREE_UNDEFINED):
        if instance[features[node]] <= threshold[node]:
            path = f'{spacer}{features[node]} ({round(instance[features[node]], 2)}) <= {round(threshold[node], 2)}'
            next_node = left[node]
        else:
            path = f'{spacer}{features[node]} ({round(instance[features[node]], 2)}) > {round(threshold[node], 2)}'
            next_node = right[node]
        return path + '\n' + tree_path(instance, values, left, right, threshold, features, next_node, depth+1)
    else:
        target = values[node]
        for i, v in zip(np.nonzero(target)[1],
                        target[np.nonzero(target)]):
            target_count = int(v)
            return spacer + "==> " + str(round(target[0][0], 2)) + \
                   " ( " + str(target_count) + " examples )"

def get_path_code(tree, feature_names, instance):
    left      = tree.tree_.children_left
    right     = tree.tree_.children_right
    threshold = tree.tree_.threshold
    features  = [feature_names[i] for i in tree.tree_.feature]
    values = tree.tree_.value
    return tree_path(instance, values, left, right, threshold, features, 0, 0)

# print the decision path of the first intance of a panda dataframe df
print(get_path_code(tree, df.columns, df.iloc[0]))

【讨论】：

我已经制作了一个可以做到这一点的函数。我真的只是在寻找关于在每个节点上采用主导类是否是一种好的策略来报告预测如何随着树中的特定路径被遍历而变化的建议。
好的。也许，您可以将与节点下的每个叶子相关联的每个类的目标数相加。例如，您可以在一个节点上：A 类（18 个目标）、B 类（10 个目标），这可能是一个线索？