Tensorflow 图像分类教程，第二个 epoch 崩溃答案

【问题标题】：Tensorflow Image Classification Tutorial, crash on the second epochTensorflow 图像分类教程，第二个 epoch 崩溃
【发布时间】：2021-06-08 12:54:58
【问题描述】：

我正在尝试复制此处的 TensorFlow 图像分类教程 https://www.tensorflow.org/tutorials/images/classification

但是，由于某种原因，程序在第一个 epoch 之后崩溃，退出代码为 -1073740791 (0xC0000409)。

这是我复制的代码：

import matplotlib.pyplot as plt
import numpy as np
import os
import PIL
import tensorflow as tf
import matplotlib.pyplot as plt

from tensorflow import keras
from tensorflow.keras import layers
from tensorflow.keras.models import Sequential


import pathlib
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
data_dir = pathlib.Path(data_dir)

image_count = len(list(data_dir.glob('*/*.jpg')))
print(image_count)

roses = list(data_dir.glob('roses/*'))
tulips = list(data_dir.glob('tulips/*'))

batch_size = 32
img_height = 180
img_width = 180

train_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,#originally 0.2
  subset="training",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)

val_ds = tf.keras.preprocessing.image_dataset_from_directory(
  data_dir,
  validation_split=0.2,
  subset="validation",
  seed=123,
  image_size=(img_height, img_width),
  batch_size=batch_size)


class_names = train_ds.class_names

#Configure the DataSet for performance
AUTOTUNE = tf.data.AUTOTUNE


#.cache = keeps the img in memory after they have been loaded off disk during the first epoch
#.prefetch = overlaps data preprocessing and model execution while training
train_ds = train_ds.cache().shuffle(1000).prefetch(buffer_size=AUTOTUNE)
val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

#standardizing the RGB values to be within [0,1] range

normalization_layer = layers.experimental.preprocessing.Rescaling(1./255)

normalized_ds = train_ds.map(lambda x, y: (normalization_layer(x), y))
image_batch, labels_batch = next(iter(normalized_ds))
first_image = image_batch[0]
# Notice the pixels values are now in `[0,1]`.
print(np.min(first_image), np.max(first_image))

num_classes = 5

model = Sequential([
  layers.experimental.preprocessing.Rescaling(1./255, input_shape=(img_height, img_width, 3)),
  layers.Conv2D(16, 3, padding='same', activation='relu'), #Relu is Mutually exlusive, the initial classification
  layers.MaxPooling2D(),
  layers.Conv2D(32, 3, padding='same', activation='relu'),
  layers.MaxPooling2D(),
  layers.Conv2D(64, 3, padding='same', activation='relu'), #
  layers.MaxPooling2D(),
  layers.Flatten(),
  layers.Dense(128, activation='relu'),
  layers.Dense(num_classes)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.summary()

epochs=10
history = model.fit(
  train_ds,
  validation_data=val_ds,
  epochs=epochs
)

这是输出：

2021-03-10 11:33:28.285891: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
3670
Found 3670 files belonging to 5 classes.
Using 2936 files for training.
2021-03-10 11:33:40.172866: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-03-10 11:33:40.182801: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-03-10 11:33:40.600845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce 940MX computeCapability: 5.0
coreClock: 0.8605GHz coreCount: 4 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 37.33GiB/s
2021-03-10 11:33:40.602057: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-03-10 11:33:40.680462: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:33:40.680774: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-03-10 11:33:40.725064: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-03-10 11:33:40.762494: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-03-10 11:33:40.826237: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-03-10 11:33:40.879988: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-03-10 11:33:40.886178: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-03-10 11:33:41.052224: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-03-10 11:33:41.065505: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-10 11:33:41.067812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce 940MX computeCapability: 5.0
coreClock: 0.8605GHz coreCount: 4 deviceMemorySize: 2.00GiB deviceMemoryBandwidth: 37.33GiB/s
2021-03-10 11:33:41.068679: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-03-10 11:33:41.069536: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:33:41.069884: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-03-10 11:33:41.070227: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-03-10 11:33:41.072721: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-03-10 11:33:41.072988: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-03-10 11:33:41.073254: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-03-10 11:33:41.073754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-03-10 11:33:41.074267: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-03-10 11:33:43.232733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-10 11:33:43.233087: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-03-10 11:33:43.233294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-03-10 11:33:43.234651: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 1366 MB memory) -> physical GPU (device: 0, name: GeForce 940MX, pci bus id: 0000:01:00.0, compute capability: 5.0)
2021-03-10 11:33:43.238580: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Found 3670 files belonging to 5 classes.
Using 734 files for validation.
2021-03-10 11:33:43.923781: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-03-10 11:33:54.037806: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 53 of 1000
2021-03-10 11:34:03.968256: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:177] Filling up shuffle buffer (this may take a while): 87 of 1000
2021-03-10 11:34:04.136991: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:230] Shuffle buffer filled.
0.0 1.0
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
rescaling_1 (Rescaling)      (None, 180, 180, 3)       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 180, 180, 16)      448       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 90, 90, 16)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 90, 90, 32)        4640      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 45, 45, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 45, 45, 64)        18496     
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 22, 22, 64)        0         
_________________________________________________________________
flatten (Flatten)            (None, 30976)             0         
_________________________________________________________________
dense (Dense)                (None, 128)               3965056   
_________________________________________________________________
dense_1 (Dense)              (None, 5)                 645       
=================================================================
Total params: 3,989,285
Trainable params: 3,989,285
Non-trainable params: 0
_________________________________________________________________
Epoch 1/10
2021-03-10 11:34:08.244592: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-03-10 11:34:09.792816: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-03-10 11:34:09.825987: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll

Process finished with exit code -1073740791 (0xC0000409)

我做错了什么以及这个问题的原因是什么？

我的规格： Intel (R) Core (TM) i5-7200U CPU @ 2.50GHz 2.70GHz

8GB 内存

位操作系统，基于 x64 的处理器

带有 GeForce Game Ready 驱动程序 (v. 461.72) 的 NVIDIA GEFORCE 940MX

Python 版本： 3.8.5

TensorFlow： tensorflow._api.v2.version

CUDA： 11.0

【问题讨论】：

内存不足了吗？ 8BG 对我来说似乎并不多。我很难在我的 16GB 笔记本电脑上运行一些 tensorflow 作业。
事实上我做到了（尽管在这个例子中没有）。 10/10 会推荐 Google Collab 进行培训

标签： python tensorflow classification

【解决方案1】：

我从 CUDA 文件夹中删除了 cudnn64_8.dll，并将 CUDNN 主文件夹添加到路径中（而不是之前的 pin。

【讨论】：