使用 train_test_split 后分类器准确率为 100%答案

【问题标题】：100% classifier accuracy after using train_test_split使用 train_test_split 后分类器准确率为 100%
【发布时间】：2020-05-13 04:52:46
【问题描述】：

我正在研究蘑菇分类数据集（可在此处找到：https://www.kaggle.com/uciml/mushroom-classification）。

我正在尝试将我的数据拆分为我的模型的训练集和测试集，但是如果我使用 train_test_split 方法，我的模型总是可以达到 100% 的准确率。当我手动拆分数据时，情况并非如此。

x = data.copy()
y = x['class']
del x['class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

这会产生：

[[1299    0]
 [   0 1382]]
1.0

如果我手动拆分数据，我会得到更合理的结果。

x = data.copy()
y = x['class']
del x['class']

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

结果：

[[2007    0]
 [ 336  337]]
0.8746268656716418

什么可能导致这种行为？

编辑： 根据要求，我包括切片的形状。

train_test_split：

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果：

(5443, 64)
(5443,)
(2681, 64)
(2681,)

手动拆分：

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

结果：

(5443, 64)
(5443,)
(2680, 64)
(2680,)

我已经尝试定义自己的拆分函数，结果拆分也可以实现 100% 的分类器准确度。

这是拆分的代码

def split_data(dataFrame, testRatio):
  dataCopy = dataFrame.copy()
  testCount = int(len(dataFrame)*testRatio)
  dataCopy = dataCopy.sample(frac = 1)
  y = dataCopy['class']
  del dataCopy['class']
  return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]

【问题讨论】：

X_train, X_test, y_train, y_test各个方法拆分后的形状是什么？
@G.Anderson 我已经用形状更新了我的问题
如果再次运行 train_test_plit 或更改 test_size 参数，行为是否仍然存在？有可能（虽然不太可能）你第一次得到了一个非常幸运的分裂。否则，您是否对数据进行了任何其他未显示的转换？这看起来很像训练和测试之间或目标和特征之间的数据泄漏
它会在整个尝试过程中持续存在，如果我更改测试大小（无论我将其更改为 100%）。我已经对数据进行了一些预处理，但这一切都是在我拆分数据集之前完成的。
等等！分手前preprocessing做了什么？您不应该对整个数据集执行feature selection。就在火车集和变换火车上，用它来测试集。 standard scalar 也相同，在拆分和转换两个训练后拟合训练数据，用它进行测试。如果您的手动拆分代码没有问题，您可能会以这种方式将数据从训练集泄漏到测试集。

标签： python dataframe machine-learning

【解决方案1】：

您的手动训练测试拆分没有随机播放，但 scikit 功能默认开启随机播放。分割形状相同，但数据不同。

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

代码：

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

print("\nTraining with shuffle:")
print(X_train)
print(y_train)


print("\nTesting with shuffle:")
print(X_test)
print(y_test)


print("\nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])

输出：

[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]

Training with shuffle:
[[ 0  1]
 [16 17]
 [ 4  5]
 [ 8  9]
 [ 6  7]
 [12 13]]
[0, 8, 2, 4, 3, 6]

Testing with shuffle:
[[14 15]
 [ 2  3]
 [10 11]]
[7, 1, 5]

Without Shuffle:
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
[0, 1, 2, 3, 4, 5]

[[12 13]
 [14 15]
 [16 17]]
[6, 7, 8]

【讨论】：

那部分我明白了，但是为什么会影响分类器的结果呢？
这似乎更像是评论而不是答案。您是否确认这会影响 OP 中描述的模型行为？
将 shuffle 设置为 false 确实会影响模型的准确性。
似乎无论我做什么，如果我打乱数据的顺序，我都会得到 100% 的准确率。
我刚刚检查了 UCI 存储库，它说，可食用：4208 (51.8%)，有毒：3916 (48.2%)，没有随机播放数据集的 33% 拆分不平衡。 8124 * 0.33 = 2681 测试和 8124 - 2681 = 5443 训练。如果数据集不是按顺序组织的，首先是可食用的，然后是有毒的，那么 5443 - 4208 = 1235 表示有毒，4208 表示可食用。这是不平衡的。

【解决方案2】：

结果结果是正确的，我只是在测试模型产生的结果时走错了路。

我打开了另一个thread，有人建议尝试交叉验证，这似乎可以解决问题。

【讨论】：

【解决方案3】：

你在 train_test_split 上很幸运。您手动进行的拆分可能包含最不可见的数据，这比 train_test_split 进行更好的验证，后者在内部对数据进行混洗以拆分它。

为了更好地验证，请使用 K 折交叉验证，这将允许验证模型的准确性，将数据中的每个不同部分作为测试，将其余部分作为训练。

【讨论】：