我无法使用 GPU 使用 TensorFlow 训练我的神经网络答案

【问题标题】：I can't train my NN with TensorFlow using GPU我无法使用 GPU 使用 TensorFlow 训练我的神经网络
【发布时间】：2021-12-10 15:30:00
【问题描述】：

我的第一个深度学习项目（计算机视觉 - 糖尿病视网膜病变）。

我正在尝试使用我的 GPU (NVidia RTX 3050) 运行我的实验。

我按照随附的教程 https://shawnhymel.com/1961/how-to-install-tensorflow-with-gpu-support-on-windows/ 安装 Cuda 和 cudNN 以启用 TensorFlow with GPU。

IDE：PyCharm 2021.3

解释器：Python 3.9 (conda)

支持 Python 3.9 GPU 的 TensorFlow 版本：https://storage.googleapis.com/tensorflow/windows/gpu/tensorflow_gpu-2.7.0-cp39-cp39-win_amd64.whl

代码：

    print(tf.config.experimental.list_physical_devices())

    train_df, valid_df, test_df = get_dataset.get_datasets()
    trainGen, valGen, testGen = data_gen.get_data_generators(train_df, valid_df, test_df, image_size=IMAGE_SIZE, BS=32)
    with tf.device('/GPU:0'):
        model = model.get_model(image_size=299, model_type='InceptionV3_att')  # 'InceptionV3' \ 'InceptionV3_att' \
    # 'DenseNet121'

        model.compile(
            optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),  # by default learning_rate=0.001
            loss='categorical_crossentropy',
            metrics=[tf.keras.metrics.CategoricalAccuracy(name="cat_acc"), tf.keras.metrics.AUC(name='auc'),
                    tf.keras.metrics.Recall(name='recall'), tf.keras.metrics.Precision(name='precision'),
                    tfa.metrics.CohenKappa(num_classes=5, sparse_labels=False, weightage='quadratic')]
        )
        model.summary()
        # keras.utils.plot_model(model, show_shapes=True)

        history = model.fit(
            trainGen,
            epochs=NUM_EPOCHS,
            steps_per_epoch=len(train_df) // BS,
            validation_data=valGen,
            validation_steps=len(valid_df) // BS,
            class_weight={0: len(train_df[train_df['Label'] == '0']) / len(train_df),
                          1: len(train_df[train_df['Label'] == '1']) / len(train_df),
                          2: len(train_df[train_df['Label'] == '2']) / len(train_df),
                          3: len(train_df[train_df['Label'] == '3']) / len(train_df),
                          4: len(train_df[train_df['Label'] == '4']) / len(train_df)},
            shuffle=True,
            callbacks=[
                # tf.keras.callbacks.EarlyStopping(patience=11, verbose=1),
                tf.keras.callbacks.ReduceLROnPlateau(patience=4, verbose=1),
                tf.keras.callbacks.ModelCheckpoint(filepath='bestmodel.h5', save_best_only=True, verbose=1)]
        )

除非我使用 CPU，否则我无法训练我的模型

with tf.device('/CPU:0'):

得到这个输出：

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]

2021-12-10 17:10:52.291326: I tensorflow/core/platform/cpu_feature_guard.cc:151] 这个 TensorFlow 二进制文件使用 oneAPI 深度神经网络库 (oneDNN) 进行了优化，以使用以下 CPU 指令在性能关键操作中：AVX AVX2 要在其他操作中启用它们，请使用适当的编译器标志重新构建 TensorFlow。

2021-12-10 17:10:53.019482: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] 创建设备 /job:localhost/replica:0/task:0/device:GPU :0 具有 1671 MB 内存：-> 设备：0，名称：NVIDIA GeForce RTX 3050 笔记本 GPU，pci 总线 ID：0000:01:00.0，计算能力：8.6

纪元 1/20

2021-12-10 17:11:00.077237: I tensorflow/stream_executor/cuda/cuda_dnn.cc:366] 加载 cuDNN 版本 8301

进程以退出代码 -1073740791 (0xC0000409) 结束

【问题讨论】：

this question 会回答你的问题吗？
我刚刚遇到了一个非常相似的问题，Marte 的帖子确实帮助了我（谢谢！）：我已经安装了 CUDA 和 cudnn，但忽略了 docs.nvidia.com/deeplearning/cudnn/install-guide/index.html 中关于 zlibwapi.dll 的部分。缺少此 DLL 正是导致 OP 提到的症状。我没有使用通过不安全连接提供的过时的预构建二进制文件，而是使用 contrib/vsstudio/vc14 中的 SLN 从最新的 zlib 1.2.11 源版本重建 DLL，然后将生成的二进制文件放入我的 c:/dev /cudnn/bin 文件夹已经在 PATH 中。

标签： python gpu tensorflow2.0

【解决方案1】：

ASUS Dual GeForce RTX 3050 OC  

NVIDIA-Linux-x86_64-510.47.03.run (location:Data Center/Tesla)  
cuda_11.3.1_465.19.01_linux.run  
libcudnn8_8.2.1.32-1+cuda11.3_amd64.deb  
libcudnn8-dev_8.2.1.32-1+cuda11.3_amd64.deb  
libcudnn8-samples_8.2.1.32-1+cuda11.3_amd64.deb  
pip3 install torch==1.11.0+cu113 torchvision==0.12.0+cu113 torchaudio==0.11.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html  

installed and test PPO-AI in cuda OK  
I think it's fit to tensorflow too  
you can try these files install and test again   

if you wan to uninstall current cuda  
sudo /usr/local/cuda-X.Y/bin/cuda-uninstaller

Cuda及对应驱动
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html
司机
https://www.nvidia.com/download/find.aspx
所有 cuda 版本
https://developer.nvidia.com/cuda-toolkit-archive

【讨论】：