将张量拆分为训练集和测试集答案

【问题标题】：Split tensor into training and test sets将张量拆分为训练集和测试集
【发布时间】：2017-06-11 02:33:13
【问题描述】：

假设我使用TextLineReader 读入了一个文本文件。有什么方法可以将其拆分为Tensorflow 中的训练集和测试集吗？比如：

def read_my_file_format(filename_queue):
  reader = tf.TextLineReader()
  key, record_string = reader.read(filename_queue)
  raw_features, label = tf.decode_csv(record_string)
  features = some_processing(raw_features)
  features_train, labels_train, features_test, labels_test = tf.train_split(features,
                                                                            labels,
                                                                            frac=.1)
  return features_train, labels_train, features_test, labels_test

【问题讨论】：

相关：*.com/questions/54519309/…

标签： tensorflow cross-validation training-data

【解决方案1】：

正如 elham 所说，您可以使用 scikit-learn 轻松完成此操作。 scikit-learn 是一个用于机器学习的开源库。有大量的数据准备工具，包括处理比较、验证和选择参数的 model_selection 模块。

model_selection.train_test_split() 方法专门用于将您的数据按百分比随机分成训练集和测试集。

X_train, X_test, y_train, y_test = train_test_split(features,
                                                    labels,
                                                    test_size=0.33,
                                                    random_state=42)

test_size 是为测试预留的百分比，random_state 是随机抽样的种子。

我通常使用它来提供训练和验证数据集，并分别保留真实的测试数据。您也可以运行两次train_test_split 来执行此操作。 IE。将数据拆分为 (Train + Validation) 和 Test，然后将 Train + Validation 拆分为两个单独的张量。

【讨论】：

谢谢，但这并不能回答问题。我正在使用TextLineReader，所以数据现在是张量。 scikit-learn 适用于 numpy 数组而不是 tensorflow 张量。
明白了。我认为它应该适用于任何可枚举的 python 类型。我得试试看。

【解决方案2】：

以下内容应该可以工作： tf.split_v(tf.random_shuffle(...

编辑：对于 tensorflow>0.12 现在应该称为 tf.split(tf.random_shuffle(...

Reference

有关示例，请参阅 tf.split 和 tf.random_shuffle 的文档。

【讨论】：

无论如何要以百分比而不是绝对数字来计算？

【解决方案3】：

import sklearn.model_selection as sk

X_train, X_test, y_train, y_test = 
sk.train_test_split(features,labels,test_size=0.33, random_state = 42)

【讨论】：

虽然欢迎使用此代码 sn-p，并且可能会提供一些帮助，但它会是 greatly improved if it included an explanation of how 和 why 这解决了问题。请记住，您正在为将来的读者回答问题，而不仅仅是现在提问的人！请edit您的答案添加解释，并说明适用的限制和假设。
我同意这个答案需要解释，但它非常有帮助，因为它为 OP 指明了正确的方向。 sklearn.model_selection 提供了很好的工具来拆分成训练集、验证集和测试集。您可以使用 tensorflow.split_v “手动”拆分数据，但 sklearn 会为您完成！
要将数据拆分为训练和测试，请使用 sklearn.model_selection 中的 train_test_split 函数。你需要确定分裂的百分比。 test_size=0.33 表示原始数据的 33% 用于测试，其余用于训练。此函数将返回四个元素，即训练集和测试集的数据和标签。 X 表示数据，y 表示标签
我想，最好在预处理结束时进行。为什么需要张量？我只是好奇。

【解决方案4】：

使用 tf.data.Dataset api 的 map 和 filter 函数，我得到了很好的结果。只需使用 map 函数在训练和测试之间随机选择示例。为此，您可以为每个示例从均匀分布中获取样本，并检查样本值是否低于比率除法。

def split_train_test(parsed_features, train_rate):
    parsed_features['is_train'] = tf.gather(tf.random_uniform([1], maxval=100, dtype=tf.int32) < tf.cast(train_rate * 100, tf.int32), 0)
    return parsed_features

def grab_train_examples(parsed_features):
    return parsed_features['is_train']

def grab_test_examples(parsed_features):
    return ~parsed_features['is_train']

【讨论】：

【解决方案5】：

我通过封装来自 sklearn 的 train_test_split 函数来临时提出一个解决方案，以便接受张量作为输入并返回张量。

我是 tensorflow 的新手，面临同样的问题，所以如果你有更好的解决方案而不使用其他包，我将不胜感激。

def train_test_split_tensors(X, y, **options):
    """
    encapsulation for the sklearn.model_selection.train_test_split function
    in order to split tensors objects and return tensors as output

    :param X: tensorflow.Tensor object
    :param y: tensorflow.Tensor object
    :dict **options: typical sklearn options are available, such as test_size and train_size
    """

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(X.numpy(), y.numpy(), **options)

    X_train, X_test = tf.constant(X_train), tf.constant(X_test)
    y_train, y_test = tf.constant(y_train), tf.constant(y_test)

    del(train_test_split)

    return X_train, X_test, y_train, y_test

【讨论】：