如何在 scikit learn 中训练测试拆分 [重复]答案

【问题标题】：how can I train test split in scikit learn [duplicate]如何在 scikit learn 中训练测试拆分 [重复]
【发布时间】：2021-06-20 07:02:33
【问题描述】：

有谁知道问题出在哪里？

x=np.linspace(-3,3,100)
rng=np.random.RandomState(42)
y=np.sin(4*x)+x+rng.uniform(size=len(x))
X=x[:,np.newaxis]
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test=train_test_split(X,y,test_size=0.25,random_state=42,stratify=y)

我有这个错误：

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

【问题讨论】：

标签： python scikit-learn train-test-split

【解决方案1】：

来自documentation：

3.1.2.2。基于类标签的分层交叉验证迭代器。

一些分类问题可能表现出很大的不平衡目标类的分布：例如可能有负样本比正样本多几倍。在这样的建议使用分层抽样，如 StratifiedKFold 和 StratifiedShuffleSplit 确保相对类频率在每列火车中大致保留，并且验证折叠。

【讨论】：

【解决方案2】：

train_test_split 中的参数 (stratify = y) 给出了错误。当您的标签具有重复值时使用分层。例如：假设您的标签列的值为 0 和 1。然后通过分层 = y，将保留您的标签在训练样本中的原始比例。比如说，如果你有 60% 的 1 和 40% 的 0，那么你的训练样本也将具有相同的比例。

【讨论】：

【解决方案3】：

尝试删除stratify=y，你应该这样做。另外，请看here。

【讨论】：

请注意，在这种情况下，我们会将问题标记为重复问题，而不是回答问题。
我找不到这样做的选项，我不知道它是否仍然因为我的低声誉而被锁定，或者我只是找不到它......