从多个 TFRecord 文件中读取

【问题标题】：Reading from multiple TFRecord files从多个 TFRecord 文件中读取
【发布时间】：2019-09-23 15:30:23
【问题描述】：

我正在使用多个 tfRecord 文件并希望从中读取以创建数据集。我正在尝试使用路径 from_tensor_slices 并使用该数据集进一步读取 TFRecords

（多tfRecords的优势：https://datascience.stackexchange.com/questions/16318/what-is-the-benefit-of-splitting-tfrecord-file-into-shards）

我想知道是否有更简单且经过验证的方法来做到这一点。

file_names_dataset = tf.data.Dataset.from_tensor_slices(filenames_full)

def read(inp):
    return tf.data.TFRecordDataset(inp)

file_content = file_names.map(read)

我的下一步是使用 tf.io.parse_single_example 解析数据集。

【问题讨论】：

标签： tensorflow-datasets tensorflow2.0

【解决方案1】：

tf.data.TFRecordDataset constructor 已经接受文件名列表或张量。因此，您可以使用您的文件名直接调用它：file_content = tf.data.TFRecordDataset(filenames_full)

来自tf.io.parse_single_example documentation：

通过使用 parse_example 批处理示例原型而不是直接使用此函数，可能会看到性能优势。

因此，我建议在将 tf.io.parse_example 函数映射到它之前对您的数据集进行批处理：

tf.data.TFRecordDataset(
  filenames_full
).batch(
  my_batch_size
).map(
  lambda batch: tf.io.parse_example(batch, my_features)
)

如果您想要一个完整的示例，请在this post 中分享我的输入管道（从许多 TFRecord 文件中读取）。

好心，亚历克西斯。

【讨论】：