使用 Python 在 Scikit Learn 中测试 DecisionTreeClassifier 时出错答案

【问题标题】：Error while testing DecisionTreeClassifier in Scikit Learn with Python使用 Python 在 Scikit Learn 中测试 DecisionTreeClassifier 时出错
【发布时间】：2015-04-28 22:25:57
【问题描述】：

我从 csv 文件中读取数据，第一行是字符串，其余都是小数。我必须将此文件中的数据从字符串转换为小数，现在我正在尝试对这些数据运行决策树分类器。我可以很好地训练数据，但是当我调用 DecisionTreeClassifier.score() 时，我收到错误消息：“不支持未知”

这是我的代码：

cVal = KFold(len(file)-1, n_folds=10, shuffle=True);
for train_index, test_index in cVal:
    obfA_train, obfA_test = np.array(obfA)[train_index], np.array(obfA)[test_index]
    tTime_train, tTime_test = np.array(tTime)[train_index], np.array(tTime)[test_index]
    model = tree.DecisionTreeClassifier()
    model = model.fit(obfA_train.tolist(), tTime_train.tolist())
    print model.score(obfA_test.tolist(), tTime_test.tolist())

我之前用这些行填充了 obfA 和 tTime：

tTime.append(Decimal(file[i][11].strip('"')))
obfA[i-1][j-1] = Decimal(file[i][j].strip('"'))

所以 obfA 是一个二维数组，而 tTime 是一维的。之前我尝试在上面的代码中去掉“tolist()”，但是并没有影响报错。这是它打印的错误报告：

in <module>()
---> print model.score(obfA_test.tolist(), tTime_test.tolist())

in score(self, X, y, sample_weight)
    """
    from .metrics import accuracy_score
 -->return accuracy_score(y, self.predict(X), sample_weight=sample_weight)

in accuracy_score(y_true, y_pred, normalize, sample_weight)
    # Compute accuracy for each possible representation
  ->y_type, y_true, y_pred = _check_clf_targets(y_true, y_pred)
    if y_type == 'multilabel-indicator':
        score = (y_pred != y_true).sum(axis=1) == 0

in _check_clf_targets(y_true, y_pred)
    if (y_type not in ["binary", "multiclass", "multilabel-indicator", "multilabel-sequences"]):
        -->raise ValueError("{0} is not supported".format(y_type))
    if y_type in ["binary", "multiclass"]:

ValueError: unknown is not supported

我添加了打印语句来检查输入参数的尺寸这是它打印的内容：

obfA_test.shape: (48L, 12L)
tTime_test.shape: (48L,)

我很困惑为什么错误报告显示 score() 的 3 个必需参数，但文档只有 2 个。什么是“self”参数？谁能帮我解决这个错误？

【问题讨论】：

标签： python types tree crash scikit-learn

【解决方案1】：

这似乎让人想起错误discussed here。问题似乎源于您用于拟合和评分模型的数据类型。在填充输入数据数组时，请尝试使用 float，而不是 Decimal。所以我没有不准确的答案——你不能对 DecisionTreeClassifiers 使用浮点数/连续值。如果要使用浮点数，请使用 DecisionTreeRegressor。否则，请尝试使用整数或字符串（但这可能会偏离您要完成的任务）。

至于最后的 self 问题，这是 Python 的语法特性。当您执行 model.score(...) 时，Python 将其视为 score(model, ...)。恐怕我现在对它的了解不多，但没有必要回答你原来的问题。 Here's an answer that better addresses that particular question.

【讨论】：

我将所有内容从 Decimal 更改为 float，但将错误从“不支持未知”更改为“不支持连续”
更好地了解您想要做什么会很有帮助。在您的主要问题中，您能简要描述一下 obfA 和 tTime 吗？我刚刚意识到您可能无法将浮点数用于 DecisionTreeClassifier，但您可能会发现使用 DecisionTreeRegressor 更有用，具体取决于您的任务。

【解决方案2】：

我意识到我遇到的问题是因为我试图使用 DecisionTreeClassifier 来预测连续值，而它们只能用于预测离散值。我将不得不改用回归模型。

【讨论】：