如何从 tfrecords 目录创建 tf.data.dataset？答案

【问题标题】：How to create tf.data.dataset from directories of tfrecords?如何从 tfrecords 目录创建 tf.data.dataset？
【发布时间】：2018-10-25 16:04:21
【问题描述】：

我的数据集有不同的目录，每个目录对应一个类。每个目录中有不同数量的 .tfrecord。我的问题是如何从每个目录中采样 5 个图像（每个 .tfrecord 文件对应一个图像）？我的另一个问题是，我怎样才能对这些目录中的 5 个进行采样，然后从每个目录中采样 5 个图像。

我只想用 tf.data.dataset 来做。所以我想要一个数据集，从中获得一个迭代器，并且 iterator.next() 为我提供了一批 25 张图像，其中包含来自 5 个类的 5 个样本。

【问题讨论】：

这可能听起来很傻，但是既然您需要每个类中的确切 5 个图像，为什么不创建 5 个 tf.data.dataset 实例，每个实例有 5 个 batch_size 呢？否则，tf.data.TFRecordDataset 可以接受字符串列表作为输入，但您对采样过程的控制较少。
那么如果我想用 6 个样本做另一个实验，我必须重新创建文件。 10 个样本等也会发生同样的情况。

标签： tensorflow tensorflow-datasets

【解决方案1】：

请在下面找到一个可能的解决方案。

为了演示，我使用 python 生成器而不是 TFRecords 作为输入（我假设您知道如何使用 TF Dataset 来读取和解析每个文件夹中的文件。其他线程在其他方面涵盖了这一点，例如here)。

import tensorflow as tf
import numpy as np

def get_class_generator(class_id, num_el, el_shape=(32, 32), el_dtype=np.int32):
    """ Returns a dummy generator, 
        outputting "num_el" elements of a single class (input data & class label) 
    """
    def class_generator():
        x = 0
        for x in range(num_el):
            element = np.ones(el_shape, dtype=el_dtype) * x
            yield element, class_id
    return class_generator


def concatenate_datasets(datasets):
    """ Concatenate a list of datasets together.
        Snippet by user2781994 (https://stackoverflow.com/a/49069420/624547)
    """
    ds0 = tf.data.Dataset.from_tensors(datasets[0])
    for ds1 in datasets[1:]:
        ds0 = ds0.concatenate(tf.data.Dataset.from_tensors(ds1))
    return ds0


num_classes = 11
class_batch_size = 3
num_classes_per_batch = 5
# note: using 3 instead of 5 for class_batch_size in this example 
#       just to distinguish between the 2 vars.

# Initializing per-class datasets:
# (note: replace tf.data.Dataset.from_generator(...) to suit your use-case
#        e.g. tf.contrib.data.TFRecordDataset(glob.glob(perclass_tfrecords_path))
#                            .map(your_parsing_function)
class_datasets = [tf.data.Dataset
                 .from_generator(get_class_generator(
                      class_id, num_el=np.random.randint(1, 60) 
                      # ^ simulating unequal number of samples per class
                      ), (tf.int32, tf.int32), ([32, 32], []))
                 .repeat(-1)
                 .batch(class_batch_size)
                  for class_id in range(num_classes)]

# Initializing complete dataset:
dataset = (tf.data.Dataset
           # Concatenating all the class datasets together:
           .zip(tuple(class_datasets))
           .flat_map(lambda *args: concatenate_datasets(args))
           # Shuffling the class datasets:
           .shuffle(buffer_size=num_classes)
           # Flattening batches from shape (num_classes_per_batch, class_batch_size, ...)
           # into (num_classes_per_batch * class_batch_size, ...):
           .flat_map(lambda *args: tf.data.Dataset.from_tensor_slices(args))
           # Returning correct number of el. (num_classes_per_batch * class_batch_size):
           .batch(num_classes_per_batch * class_batch_size))

# Visualizing results:
next_batch = dataset.make_one_shot_iterator().get_next()
with tf.Session() as sess:
    for i in range(10):
        batch = sess.run(next_batch)
        print(">> batch {}".format(i))
        print("- inputs shape: {} ; label shape: {}".format(batch[0].shape,batch[1].shape))
        print("- class values: {}".format(batch[1]))

输出：

>> batch 0
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [ 1  1  1  0  0  0 10 10 10  2  2  2  9  9  9]
>> batch 1
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [0 0 0 2 2 2 3 3 3 5 5 5 6 6 6]
>> batch 2
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [ 9  9  9  8  8  8  4  4  4  3  3  3 10 10 10]
>> batch 3
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [7 7 7 8 8 8 6 6 6 6 6 6 2 2 2]
>> batch 4
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [1 1 1 0 0 0 1 1 1 8 8 8 5 5 5]
>> batch 5
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [2 2 2 4 4 4 9 9 9 5 5 5 5 5 5]
>> batch 6
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [0 0 0 7 7 7 3 3 3 9 9 9 7 7 7]
>> batch 7
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [10 10 10 10 10 10  1  1  1  6  6  6  7  7  7]
>> batch 8
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [4 4 4 3 3 3 5 5 5 6 6 6 3 3 3]
>> batch 9
- inputs shape: (15, 32, 32) ; label shape: (15,)
- class values: [8 8 8 9 9 9 2 2 2 8 8 8 0 0 0]

【讨论】：

对于第 5 批： - 输入形状： (15, 32, 32) ;标签形状：(15,) - 类值：[2 2 2 4 4 4 9 9 9 5 5 5 5 5 5]。这不是我想要的。我想从每个班级中获得相同数量的样本。这里我们有来自第 5 类的 6 个样本。
是的；使用此解决方案，一个类可能会在一批中出现两次（因此类5 的“2 * 3”样本）。 @mrry 的解决方案可能会避免这种情况。

【解决方案2】：

编辑：如果类的数量大于 5，那么您可以使用新的 tf.contrib.data.sample_from_datasets() API（目前在 tf-nightly 中可用，将在 TensorFlow 1.9 中可用）。

directories = ["class_0/*", "class_1/*", "class_2/*", "class_3/*", ...]

CLASSES_PER_BATCH = 5
EXAMPLES_PER_CLASS_PER_BATCH = 5
BATCH_SIZE = CLASSES_PER_BATCH * EXAMPLES_PER_CLASS_PER_BATCH
NUM_CLASSES = len(directories)


# Build one dataset per class.
per_class_datasets = [
    tf.data.TFRecordDataset(tf.data.Dataset.list_files(d)) for d in directories]

# Next, build a dataset where each element is a vector of 5 classes to be chosen
# for a particular batch.
classes_per_batch_dataset = tf.contrib.data.Counter().map(
    lambda _: tf.random_shuffle(tf.range(NUM_CLASSES))[:CLASSES_PER_BATCH]))

# Transform the dataset of per-batch class vectors into a dataset with one
# one-hot element per example (i.e. 25 examples per batch).
class_dataset = classes_per_batch_dataset.flat_map(
    lambda classes: tf.data.Dataset.from_tensor_slices(
        tf.one_hot(classes, num_classes)).repeat(EXAMPLES_PER_CLASS_PER_BATCH))

# Use `tf.contrib.data.sample_from_datasets()` to select an example from the
# appropriate dataset in `per_class_datasets`.
example_dataset = tf.contrib.data.sample_from_datasets(per_class_datasets,
                                 class_dataset)

# Finally, combine 25 consecutive examples into a batch.
result = example_dataset.batch(BATCH_SIZE)

如果您正好有 5 个类，您可以为每个目录定义一个嵌套数据集，并使用 Dataset.interleave() 组合它们：

# NOTE: We're assuming that the 0th directory contains elements from class 0, etc.
directories = ["class_0/*", "class_1/*", "class_2/*", "class_3/*", "class_4/*"]
directories = tf.data.Dataset.from_tensor_slices(directories)
directories = directories.apply(tf.contrib.data.enumerate_dataset())    

# Define a function that maps each (class, directory) pair to the (shuffled)
# records in those files.
def per_directory_dataset(class_label, directory_glob):
  files = tf.data.Dataset.list_files(directory_glob, shuffle=True)
  records = tf.data.TFRecordDataset(records)
  # Zip the records with their class. 
  # NOTE: This part might not be necessary if the records contain information about
  # their class that can be parsed from them.
  return tf.data.Dataset.zip(
      (records, tf.data.Dataset.from_tensors(class_label).repeat(None)))

# NOTE: The `cycle_length` and `block_length` here aren't strictly necessary,
# because the batch size is exactly `number of classes * images per class`.
# However, these arguments may be useful if you want to decouple these numbers.
merged_records = directories.interleave(per_directory_dataset,
                                        cycle_length=5, block_length=5)
merged_records = merged_records.batch(25)

【讨论】：

这看起来确实比我的看法更优雅。 :) 但是我想知道：这可以与num_classes > 5 一起使用吗？在这种情况下，我找不到使用Dataset.interleave() 来选择每批恰好 5 个类的元素的方法...
这取决于您希望在生成的批次中使用哪种混合。一种选择是设置cycle_length=num_classes，并尝试调整block_length，但这会给你一个确定性的组合，这可能是不可取的。在 TF 1.9（和当前的 nightlies）中，您可以使用 tf.contrib.data.sample_from_datasets()，它允许您根据特定的权重分布从输入数据集列表中随机抽样，并提供更多控制，尤其是当权重本身是分布数据集时指示要选择的课程。
我刚刚试了一下你的代码。事实上，它只会生成 5 个第一类的批次，直到它们用完，然后再从下一个类中采样。但是，是的，我想这取决于 OP 想要什么样的混合。我不知道tf.contrib.data.sample_from_datasets()，这似乎是一个非常有用的功能。感谢分享！
我试过你的代码。似乎我总是在每个 iterator.next() 获得相同的类样本。我想要的是每次调用 iterator.next() 时获得 5 个不同的类。
@Siavash 我想我现在更好地理解了你的问题......请查看更新版本。