使用 TensorFlow 对 numpy 数组进行混洗的不一致行为答案

【问题标题】：Inconsistent behaviour shuffling numpy arrays using TensorFlow使用 TensorFlow 对 numpy 数组进行混洗的不一致行为
【发布时间】：2021-10-30 22:11:16
【问题描述】：

我在 Tensorflow 中洗牌 numpy 数组时遇到了一个奇怪的行为（使用 Google Colab）：

from matplotlib import pyplot as plt
import tensorflow as tf
import numpy as np

seed = int(np.random.randint(0, 2 ** 16))
(train_x, train_y), (test_x, test_y) = tf.keras.datasets.cifar10.load_data()
train_x = train_x / 255.0 # this line
train_x = tf.random.shuffle(train_x, seed=seed)
train_y = tf.random.shuffle(train_y, seed=seed)
train_dataset = tf.data.Dataset.from_tensor_slices((train_x, train_y))

for i in train_dataset.take(10):
    print(f"Label: {i[1].numpy()[0]}", end=', ')
    plt.figure()
    plt.imshow(i[0])

以这种方式对 train_x 和 train_y（都是 numpy 数组）进行混洗后，我在视觉上确认索引之间的关系得到维护，即似乎每次调用 shuffle 都会重置 rng 并且两次都得到相同的排列。但是，当我注释掉规范化步骤（标记为“这条线”）时，改组会破坏索引之间的关系。

我不理解这种行为，并想了解为什么会发生这种情况。任何帮助表示赞赏。

【问题讨论】：

标签： numpy tensorflow

【解决方案1】：

对我来说，在 google colab 上，您的代码 无论是否包含规范化行，都没有重现相同的排列。

产生相同排列的是设置顶级种子，而不是将种子作为参数提供给函数，如下所示：

import tensorflow as tf

seed = 11030
tf.random.set_seed(seed)

(train_x, train_y), (test_x, test_y) = tf.keras.datasets.cifar10.load_data()
train_x = train_x / 255.0 # this line
train_x = tf.random.shuffle(train_x)
train_y = tf.random.shuffle(train_y)
train_dataset = tf.data.Dataset.from_tensor_slices((train_x, train_y))

# ...visualize output or print results of arrays to confirm...

【讨论】：