【问题标题】:Resnet Model taking too long to trainResnet 模型训练时间过长
【发布时间】:2021-01-12 00:44:08
【问题描述】:

我正在使用this 教程为我的模型学习迁移学习。我们可以看到他的单个 epoch 平均为 1 秒。

Epoch 1/100
1080/1080 [==============================] - 10s 10ms/step - loss: 3.6862 - acc: 0.2000
Epoch 2/100
1080/1080 [==============================] - 1s 1ms/step - loss: 3.0746 - acc: 0.2574
Epoch 3/100
1080/1080 [==============================] - 1s 1ms/step - loss: 2.6839 - acc: 0.3185
Epoch 4/100
1080/1080 [==============================] - 1s 1ms/step - loss: 2.3929 - acc: 0.3583
Epoch 5/100
1080/1080 [==============================] - 1s 1ms/step - loss: 2.1382 - acc: 0.3870
Epoch 6/100
1080/1080 [==============================] - 1s 1ms/step - loss: 1.7810 - acc: 0.4593

但是当我为我的 cifar 模型遵循几乎相同的代码时,我的单个 epoch 需要大约 1 小时才能运行。

Train on 50000 samples
 3744/50000 [=>............................] - ETA: 43:38 - loss: 3.3223 - acc: 0.1760
1

我的代码是

from tensorflow.keras.applications import ResNet50
from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
from tensorflow.keras import Model

(x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
x_train = x_train / 255.0
x_test = x_test / 255.0

y_train = to_categorical(y_train)
y_test = to_categorical(y_test)

base_model = ResNet50(weights= None, include_top=False, input_shape= (32,32,3))

x = base_model.output
x = GlobalAveragePooling2D()(x)
x = Dropout(0.4)(x)
predictions = Dense(10 , activation= 'softmax')(x)
model = Model(inputs = base_model.input, outputs = predictions)

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])

hist = model.fit(x_train, y_train)

请注意,我为此模型使用了 cifar 10 数据集。我的代码或数据有什么问题吗?我该如何改进呢? 1 epoch 需要 1 小时太长了。我也有 NVIDIA MX-110 2GB,TensorFlow 正在使用它。

【问题讨论】:

    标签: python tensorflow machine-learning keras deep-learning


    【解决方案1】:

    我复制并运行了您的代码,但为了让它运行,我必须在下面进行更改

    import tensorflow as tf
    from tensorflow.keras.applications import ResNet50
    from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout
    from tensorflow.keras import Model
    
    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.cifar10.load_data()
    print (len(x_train))
    x_train = x_train / 255.0
    x_test = x_test / 255.0
    
    y_train = tf.keras.utils.to_categorical(y_train)
    y_test = tf.keras.utils.to_categorical(y_test)
    
    base_model = ResNet50(weights= None, include_top=False, input_shape= (32,32,3))
    
    x = base_model.output
    x = GlobalAveragePooling2D()(x)
    x = Dropout(0.4)(x)
    predictions = Dense(10 , activation= 'softmax')(x)
    model = Model(inputs = base_model.input, outputs = predictions)
    
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['acc'])
    
    hist = model.fit(x_train, y_train, )
    # the result for 2 epochs is shown below
    50000
    Epoch 1/2
    1563/1563 [==============================] - 58s 37ms/step - loss: 2.8654 - acc: 0.2537
    Epoch 2/2
    1563/1563 [==============================] - 51s 33ms/step - loss: 2.5331 - acc: 0.2748
    

    如果您未指定批量大小,则根据 model.fit 文档默认为 32。因此,50,000 个样本/32=1563 步。出于某种原因,在您的代码中,批量大小默认为 1。我不知道为什么。所以设置 batch_size=50 然后你需要 1000 步。 为了加快速度,我会设置 weights="imagenet" 并使用

    冻结基础模型中的层
    for layer in base_model.layers:
        layer.trainable = False
    #if you set batch_size=50, weights="imagenet" with the base model frozen you get
    50000
    Downloading data from https://storage.googleapis.com/tensorflow/keras-applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
    94773248/94765736 [==============================] - 5s 0us/step
    Epoch 1/2
    1000/1000 [==============================] - 16s 16ms/step - loss: 2.5101 - acc: 0.1487
    Epoch 2/2
    1000/1000 [==============================] - 10s 10ms/step - loss: 2.1159 - acc: 0.2249
    

    【讨论】:

      【解决方案2】:

      看起来您并没有批量处理数据。因此,模型的每次前向传递都只能看到一个训练实例,效率非常低。

      尝试在你的 model.fit() 调用中设置批量大小:

      hist = model.fit(x_train, y_train, batch_size=16, epochs=num_epochs, 
                       validation_data=(x_test, y_test), shuffle=True)
      

      调整您的批量大小,使其成为可容纳在 GPU 内存中的最大批量 - 在确定一个之前尝试几个不同的值。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2019-01-09
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-06-09
        • 1970-01-01
        • 2020-10-01
        • 2022-01-23
        相关资源
        最近更新 更多