【发布时间】:2015-07-13 11:20:36
【问题描述】:
我试图使用 Scikit-learn 的 Stratified Shuffle Split 拆分样本数据集。我按照 Scikit-learn 文档 here 中显示的示例进行操作
import pandas as pd
import numpy as np
# UCI's wine dataset
wine = pd.read_csv("https://s3.amazonaws.com/demo-datasets/wine.csv")
# separate target variable from dataset
target = wine['quality']
data = wine.drop('quality',axis = 1)
# Stratified Split of train and test data
from sklearn.cross_validation import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(target, n_iter=3, test_size=0.2)
for train_index, test_index in sss:
xtrain, xtest = data[train_index], data[test_index]
ytrain, ytest = target[train_index], target[test_index]
# Check target series for distribution of classes
ytrain.value_counts()
ytest.value_counts()
但是,在运行此脚本时,我收到以下错误:
IndexError: indices are out-of-bounds
有人可以指出我在这里做错了什么吗?谢谢!
【问题讨论】:
-
看起来您的索引错误应该发生在这里:
xtrain, xtest = data[train_index], data[test_index]。如果是这样,您可以编辑您的问题以帮助其他人找到问题。
标签: python pandas scikit-learn