【发布时间】:2017-06-16 17:33:48
【问题描述】:
以下是来自full code的小sn-p
我试图理解这种拆分方法的逻辑过程。
- SHA1 编码是 40 个十六进制字符。表达式中计算了什么样的概率?
- (MAX_NUM_IMAGES_PER_CLASS + 1) 的原因是什么?为什么要加 1?
- 为 MAX_NUM_IMAGES_PER_CLASS 设置不同的值会影响分割质量吗?
-
我们能从中获得多好的分裂质量?这是分割数据集的推荐方式吗?
# We want to ignore anything after '_nohash_' in the file name when # deciding which set to put an image in, the data set creator has a way of # grouping photos that are close variations of each other. For example # this is used in the plant disease data set to group multiple pictures of # the same leaf. hash_name = re.sub(r'_nohash_.*$', '', file_name) # This looks a bit magical, but we need to decide whether this file should # go into the training, testing, or validation sets, and we want to keep # existing files in the same set even if more files are subsequently # added. # To do that, we need a stable way of deciding based on just the file name # itself, so we do a hash of that and then use that to generate a # probability value that we use to assign it. hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest() percentage_hash = ((int(hash_name_hashed, 16) % (MAX_NUM_IMAGES_PER_CLASS + 1)) * (100.0 / MAX_NUM_IMAGES_PER_CLASS)) if percentage_hash < validation_percentage: validation_images.append(base_name) elif percentage_hash < (testing_percentage + validation_percentage): testing_images.append(base_name) else: training_images.append(base_name) result[label_name] = { 'dir': dir_name, 'training': training_images, 'testing': testing_images, 'validation': validation_images, }
【问题讨论】:
标签: python machine-learning tensorflow sha