【问题标题】:How to import pre-downloaded MNIST dataset from a specific directory or folder?如何从特定目录或文件夹导入预先下载的 MNIST 数据集?
【发布时间】:2018-06-23 17:53:28
【问题描述】:

我已经从 LeCun 网站下载了 MNIST 数据集。我想要的是编写 Python 代码以提取 gzip 并直接从目录中读取数据集,这意味着我不再需要下载或访问 MNIST 站点。

愿望过程: 访问文件夹/目录 --> 提取 gzip --> 读取数据集(一种热编码)

怎么做?由于几乎所有教程都必须访问 LeCun 或 Tensoflow 站点才能下载和读取数据集。提前致谢!

【问题讨论】:

  • 您应该将 gzip 本地解压到您的计算机上,然后使用 scipy.misc.imread 或 opencv 将图像读取到 Python。
  • 你有没有尝试过?
  • 是的,我尝试删除“从 tensorflow.examples.tutorials.mnist 导入 input_data”。但它仍然从该站点下载数据集。仍在弄清楚为什么它仍然会访问和下载数据集。

标签: python tensorflow machine-learning deep-learning mnist


【解决方案1】:

这个张量流调用

from tensorflow.examples.tutorials.mnist import input_data
input_data.read_data_sets('my/directory')

...如果您那里已经有文件,则不会下载 任何东西

但如果出于某种原因您希望自己解压缩,请按照以下方式进行:

from tensorflow.contrib.learn.python.learn.datasets.mnist import extract_images, extract_labels

with open('my/directory/train-images-idx3-ubyte.gz', 'rb') as f:
  train_images = extract_images(f)
with open('my/directory/train-labels-idx1-ubyte.gz', 'rb') as f:
  train_labels = extract_labels(f)

with open('my/directory/t10k-images-idx3-ubyte.gz', 'rb') as f:
  test_images = extract_images(f)
with open('my/directory/t10k-labels-idx1-ubyte.gz', 'rb') as f:
  test_labels = extract_labels(f)

【讨论】:

【解决方案2】:

如果您提取了MNIST data,那么您可以直接使用 NumPy 进行低级加载:

def loadMNIST( prefix, folder ):
    intType = np.dtype( 'int32' ).newbyteorder( '>' )
    nMetaDataBytes = 4 * intType.itemsize

    data = np.fromfile( folder + "/" + prefix + '-images-idx3-ubyte', dtype = 'ubyte' )
    magicBytes, nImages, width, height = np.frombuffer( data[:nMetaDataBytes].tobytes(), intType )
    data = data[nMetaDataBytes:].astype( dtype = 'float32' ).reshape( [ nImages, width, height ] )

    labels = np.fromfile( folder + "/" + prefix + '-labels-idx1-ubyte',
                          dtype = 'ubyte' )[2 * intType.itemsize:]

    return data, labels

trainingImages, trainingLabels = loadMNIST( "train", "../datasets/mnist/" )
testImages, testLabels = loadMNIST( "t10k", "../datasets/mnist/" )

并转换为热编码:

def toHotEncoding( classification ):
    # emulates the functionality of tf.keras.utils.to_categorical( y )
    hotEncoding = np.zeros( [ len( classification ), 
                              np.max( classification ) + 1 ] )
    hotEncoding[ np.arange( len( hotEncoding ) ), classification ] = 1
    return hotEncoding

trainingLabels = toHotEncoding( trainingLabels )
testLabels = toHotEncoding( testLabels )

【讨论】:

    【解决方案3】:

    我将展示如何从头开始加载它(以便更好地理解),并展示如何通过matplotlib.pyplot显示数字图像

    import cPickle
    import gzip
    import numpy as np
    import matplotlib.pyplot as plt
    
    def load_data():
        path = '../../data/mnist.pkl.gz'
        f = gzip.open(path, 'rb')
        training_data, validation_data, test_data = cPickle.load(f)
        f.close()
    
        X_train, y_train = training_data[0], training_data[1]
        print X_train.shape, y_train.shape
        # (50000L, 784L) (50000L,)
    
        # get the first image and it's label
        img1_arr, img1_label = X_train[0], y_train[0]
        print img1_arr.shape, img1_label
        # (784L,) , 5
    
        # reshape first image(1 D vector) to 2D dimension image
        img1_2d = np.reshape(img1_arr, (28, 28))
        # show it
        plt.subplot(111)
        plt.imshow(img1_2d, cmap=plt.get_cmap('gray'))
        plt.show()
    

    您还可以通过此示例函数将标签矢量化为a 10-dimensional unit vector

    def vectorized_result(label):
        e = np.zeros((10, 1))
        e[label] = 1.0
        return e
    

    将上面的标签向量化:

    print vectorized_result(img1_label)
    # output as below:
    [[ 0.]
     [ 0.]
     [ 0.]
     [ 0.]
     [ 0.]
     [ 1.]
     [ 0.]
     [ 0.]
     [ 0.]
     [ 0.]]
    

    如果你想把它翻译成 CNN 输入,你可以像这样重塑它:

    def load_data_v2():
        path = '../../data/mnist.pkl.gz'
        f = gzip.open(path, 'rb')
        training_data, validation_data, test_data = cPickle.load(f)
        f.close()
    
        X_train, y_train = training_data[0], training_data[1]
        print X_train.shape, y_train.shape
        # (50000L, 784L) (50000L,)
    
        X_train = np.array([np.reshape(item, (28, 28)) for item in X_train])
        y_train = np.array([vectorized_result(item) for item in y_train])
    
        print X_train.shape, y_train.shape
        # (50000L, 28L, 28L) (50000L, 10L, 1L)
    

    【讨论】:

      猜你喜欢
      • 2017-04-03
      • 2021-01-12
      • 1970-01-01
      • 2016-05-25
      • 2013-07-19
      • 2017-02-27
      • 2021-05-31
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多