用于训练/验证/测试集拆分的 SHA 哈希答案

【问题标题】：SHA Hashing for training/validation/testing set split用于训练/验证/测试集拆分的 SHA 哈希
【发布时间】：2017-06-16 17:33:48
【问题描述】：

以下是来自full code的小sn-p

我试图理解这种拆分方法的逻辑过程。

SHA1 编码是 40 个十六进制字符。表达式中计算了什么样的概率？
(MAX_NUM_IMAGES_PER_CLASS + 1) 的原因是什么？为什么要加 1？
为 MAX_NUM_IMAGES_PER_CLASS 设置不同的值会影响分割质量吗？

我们能从中获得多好的分裂质量？这是分割数据集的推荐方式吗？

# We want to ignore anything after '_nohash_' in the file name when
  # deciding which set to put an image in, the data set creator has a way of
  # grouping photos that are close variations of each other. For example
  # this is used in the plant disease data set to group multiple pictures of
  # the same leaf.
  hash_name = re.sub(r'_nohash_.*$', '', file_name)
  # This looks a bit magical, but we need to decide whether this file should
  # go into the training, testing, or validation sets, and we want to keep
  # existing files in the same set even if more files are subsequently
  # added.
  # To do that, we need a stable way of deciding based on just the file name
  # itself, so we do a hash of that and then use that to generate a
  # probability value that we use to assign it.
  hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
  percentage_hash = ((int(hash_name_hashed, 16) %
                      (MAX_NUM_IMAGES_PER_CLASS + 1)) *
                     (100.0 / MAX_NUM_IMAGES_PER_CLASS))
  if percentage_hash < validation_percentage:
    validation_images.append(base_name)
  elif percentage_hash < (testing_percentage + validation_percentage):
    testing_images.append(base_name)
  else:
    training_images.append(base_name)

  result[label_name] = {
      'dir': dir_name,
      'training': training_images,
      'testing': testing_images,
      'validation': validation_images,
      }

【问题讨论】：

标签： python machine-learning tensorflow sha

【解决方案1】：

此代码只是将文件名“随机”（但可重复地）分布在多个 bin 中，然后将这些 bin 分为三个类别。哈希中的位数无关紧要（只要它“足够”，对于这种工作来说大概是 35 左右）。

减少模 n+1 在 [0,n] 上产生一个值，然后将其乘以 100/n 显然会产生一个值在 [0,100] 上，这被解释为百分比。 n 是MAX_NUM_IMAGES_PER_CLASS 是为了控制解释中的舍入误差不超过“一个图像”。

这种策略是合理的，但看起来比实际要复杂一些（因为仍在进行四舍五入，余数引入了偏差——尽管数字如此之大是完全无法观察到的）。您可以通过简单地预先计算每个类的 2^160 散列的整个空间的范围并仅根据两个边界检查散列来使其更简单和更准确。这仍然在概念上涉及舍入，但对于 160 位，它只是表示小数的固有特性，如浮点数中的 31%。

【讨论】：