方法“train_test_split”（scikit Learn）中的参数“stratify”答案

【问题标题】：Parameter "stratify" from method "train_test_split" (scikit Learn)方法“train_test_split”（scikit Learn）中的参数“stratify”
【发布时间】：2016-04-22 21:31:13
【问题描述】：

我正在尝试使用 scikit Learn 包中的 train_test_split，但我遇到了参数 stratify 的问题。以下是代码：

from sklearn import cross_validation, datasets 

X = iris.data[:,:2]
y = iris.target

cross_validation.train_test_split(X,y,stratify=y)

但是，我不断遇到以下问题：

raise TypeError("Invalid parameters passed: %s" % str(options))
TypeError: Invalid parameters passed: {'stratify': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])}

有人知道发生了什么吗？下面是函数文档。

[...]

分层：类数组或无（默认为无）

如果不是 None，则以分层方式拆分数据，将其用作标签数组。

0.17 版中的新功能：分层拆分

[...]

【问题讨论】：

不，都解决了。

标签： split scikit-learn training-data test-data

【解决方案1】：

在这种情况下，分层意味着 train_test_split 方法返回与输入数据集具有相同比例的类标签的训练和测试子集。

【讨论】：

【解决方案2】：

对于通过谷歌来到这里的我未来的自己：

train_test_split 现在位于model_selection，因此：

from sklearn.model_selection import train_test_split

# given:
# features: xs
# ground truth: ys

x_train, x_test, y_train, y_test = train_test_split(xs, ys,
                                                    test_size=0.33,
                                                    random_state=0,
                                                    stratify=ys)

是使用它的方式。设置random_state 是可重复性的理想选择。

【讨论】：

这应该是答案:) 谢谢

【解决方案3】：

Scikit-Learn 只是告诉您它无法识别“分层”参数，而不是您使用不正确。这是因为该参数是在 0.17 版本中添加的，如您引用的文档中所述。

所以你只需要更新 Scikit-Learn。

【讨论】：

我得到了同样的错误，虽然我有 scikit-learn 的 0.21.2 版本。 scikit-learn 0.21.2 py37h2a6a0b8_0 conda-forge

【解决方案4】：

此stratify 参数进行拆分，以便生成的样本中值的比例与提供给参数stratify 的值的比例相同。

例如，如果变量 y 是一个二进制分类变量，其值为 0 和 1，并且有 25% 的零和 75% 的一，stratify=y 将确保您的随机拆分有 25 0 的 % 和 1 的 75%。

【讨论】：

这并不能真正回答问题，但对于了解它的工作原理非常有用。非常感谢。
我仍然难以理解，为什么这种分层是必要的：如果数据中存在类别不平衡，那么在对数据进行随机拆分时不会平均保留它吗？
@HolgerBrandl 它将被平均保留；使用分层，它肯定会被保留。
@HolgerBrandl 的数据集非常小或非常不平衡，随机拆分很可能会从其中一个拆分中完全消除一个类。
@HolgerBrandl 好问题！也许我们可以先添加，你必须使用stratify 分成训练集和测试集。其次，为了纠正不平衡，您最终需要在训练集上运行过采样或欠采样。许多 Sklearn 分类器都有一个称为类权重的参数，您可以将其设置为平衡。最后，对于不平衡的数据集，您还可以采用比准确性更合适的指标。试试，F1或者ROC下的区域。

【解决方案5】：

尝试运行此代码，它“正常工作”：

from sklearn import cross_validation, datasets 

iris = datasets.load_iris()

X = iris.data[:,:2]
y = iris.target

x_train, x_test, y_train, y_test = cross_validation.train_test_split(X,y,train_size=.8, stratify=y)

y_test

array([0, 0, 0, 0, 2, 2, 1, 0, 1, 2, 2, 0, 0, 1, 0, 1, 1, 2, 1, 2, 0, 2, 2,
       1, 2, 1, 1, 0, 2, 1])

【讨论】：

@user5767535 正如你可能看到的，它在我的 Ubuntu 机器上运行，sklearn 为 '0.17' 版本，适用于 Python 3,5 的 Anaconda 发行版。如果您正确输入代码并更新您的软件，我只能建议再检查一次。
@user5767535 顺便说一句，“0.17 版中的新功能：分层拆分”让我几乎可以肯定您必须更新您的 sklearn...