【发布时间】:2021-08-19 22:48:23
【问题描述】:
我正在尝试在 CIFAR10 数据集上使用顺序 API 训练 CNN 模型,但是在训练我的模型时不知何故在第一个 epoch 之后卡住了。 我尝试运行 nvidia-smi 并发现我的 gpu 使用率不是很高,通常情况并非如此。 以下是我的代码:
from tensorflow.keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(
rotation_range = 40,
width_shift_range = 0.2,
height_shift_range = 0.2,
shear_range = 0.2,
zoom_range = 0.2,
horizontal_flip = True)
test_datagen = ImageDataGenerator()
train_generator = train_datagen.flow(
X_train, y_train,
batch_size = 64)
validation_generator = test_datagen.flow(
X_valid, y_valid,
batch_size = 64)
model = Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same', input_shape=(32, 32, 3)))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(32, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.2))
model.add(layers.Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(64, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.3))
model.add(layers.Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.Conv2D(128, (3, 3), activation='relu', kernel_initializer='he_uniform', padding='same'))
model.add(layers.BatchNormalization())
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Dropout(0.4))
model.add(layers.Flatten())
model.add(layers.Dense(128, activation='relu', kernel_initializer='he_uniform'))
model.add(layers.Dropout(0.5))
model.add(layers.Dense(10, activation='softmax'))
model.compile(loss = "sparse_categorical_crossentropy",
optimizer = "Adam",
metrics = ["accuracy"])
history = model.fit(train_generator,
steps_per_epoch = int(X_train.shape[0] / 64), # (number of images / batch size)
epochs = 50,
validation_data = validation_generator)
这是训练停止的时期:
Epoch 1/50
703/703 [==============================] - ETA: 0s - loss: 2.3803 - accuracy: 0.0768
nvidia-smi 给出以下结果:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 466.27 Driver Version: 466.27 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 On | N/A |
| N/A 51C P8 3W / N/A | 444MiB / 4096MiB | 9% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1516 C+G Insufficient Permissions N/A |
| 0 N/A N/A 4824 C+G ...lPanel\SystemSettings.exe N/A |
| 0 N/A N/A 5268 C+G C:\Windows\explorer.exe N/A |
| 0 N/A N/A 8268 C+G ...b3d8bbwe\WinStore.App.exe N/A |
| 0 N/A N/A 8612 C+G ...nputApp\TextInputHost.exe N/A |
| 0 N/A N/A 9232 C+G ...5n1h2txyewy\SearchApp.exe N/A |
| 0 N/A N/A 10280 C+G ...cw5n1h2txyewy\LockApp.exe N/A |
| 0 N/A N/A 12540 C+G ...8wekyb3d8bbwe\Cortana.exe N/A |
| 0 N/A N/A 13164 C+G ...y\ShellExperienceHost.exe N/A |
| 0 N/A N/A 13292 C+G ...me\Application\chrome.exe N/A |
| 0 N/A N/A 14008 C+G ...ekyb3d8bbwe\YourPhone.exe N/A |
| 0 N/A N/A 14420 C+G ...wekyb3d8bbwe\Video.UI.exe N/A |
| 0 N/A N/A 15408 C+G ...batNotificationClient.exe N/A |
| 0 N/A N/A 16084 C+G ...ekyb3d8bbwe\HxOutlook.exe N/A |
+-----------------------------------------------------------------------------+
【问题讨论】:
-
您可以在
model.fit中尝试steps_per_epoch = train_generator.samples // train_generator.batch_size并告诉我们? -
我使用了
steps_per_epoch = train_generator.samples // train_generator.batch_size,但它显示另一个错误:'NumpyArrayIterator' object has no attribute 'samples' -
你可以试试
steps_per_epoch = len(train_generator) /batch_size告诉我们吗? -
如果这也不起作用,你能分享完整的代码来复制你的问题吗?以便我们可以尽力帮助您。谢谢!
-
还是卡住了。以下是我的代码链接:link
标签: tensorflow conv-neural-network