迁移到 Tensorflow 2.0，训练现在在第三步后挂起答案

【问题标题】：Moved to Tensorflow 2.0, training now hangs after third step迁移到 Tensorflow 2.0，训练现在在第三步后挂起
【发布时间】：2019-11-27 00:25:18
【问题描述】：

最近我决定从 Tensorflow 的 1.14 版（gpu 变体）迁移到当前的 2.0 版。

我目前的设置是：

Tensorflow（GPU 变体）2.0
Cudnn 7.6.4
CUDA 10
Python 3.6
IDE：Visual Studio 2019

我确实预计会有一些痛苦，但这让我措手不及。

当我尝试运行我的一个（现已调整）1.14 项目时，使用 now 构建的模型出现问题，并且训练过程顺利开始。只有在第三步之后才能完全停止。同一个项目在 Tensorflow 2.0 的 cpu 变体上运行得很好，但训练所有模型需要几个数量级的时间。

这是我目前尝试过的：

更改超参数
重新安装 CUDA
重新安装张量流
重新安装cudnn
禁用验证
检查路径变量

这些都没有帮助解决这个问题。我唯一的线索是警告信息：

 Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.

我从未使用过 Tf 1.14 并且对此感到有些困惑。我知道 CUDA 可以工作，因为我编译并运行了几个 Nvidia 示例。所以剩下的唯一真正的选择是与 Tensorflow 或它如何处理 gpus 相关。

但我不知道如何前进。

会话日志如下：

019-11-27 01:03:57.910895: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudart64_100.dll
C:\Program Files (x86)\Microsoft Visual Studio\Shared\Python36_64\lib\site-packages\pandas\core\frame.py:4117: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
2019-11-27 01:04:02.247959: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library nvcuda.dll
2019-11-27 01:04:02.277414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:0a:00.0
2019-11-27 01:04:02.282378: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-27 01:04:02.286653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-27 01:04:02.289629: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2019-11-27 01:04:02.295084: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce GTX 1070 major: 6 minor: 1 memoryClockRate(GHz): 1.835
pciBusID: 0000:0a:00.0
2019-11-27 01:04:02.299843: I tensorflow/stream_executor/platform/default/dlopen_checker_stub.cc:25] GPU libraries are statically linked, skip dlopen check.
2019-11-27 01:04:02.303965: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2019-11-27 01:04:03.043700: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-27 01:04:03.047132: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2019-11-27 01:04:03.049453: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
2019-11-27 01:04:03.052642: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6382 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1070, pci bus id: 0000:0a:00.0, compute capability: 6.1)
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding (Embedding)        (None, 154, 64)           896000
_________________________________________________________________
conv1d (Conv1D)              (None, 150, 64)           20544
_________________________________________________________________
flatten (Flatten)            (None, 9600)              0
_________________________________________________________________
dense (Dense)                (None, 300)               2880300
_________________________________________________________________
dense_1 (Dense)              (None, 150)               45150
_________________________________________________________________
dense_2 (Dense)              (None, 70)                10570
_________________________________________________________________
dense_3 (Dense)              (None, 10)                710
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 22
=================================================================
Total params: 3,853,296
Trainable params: 3,853,296
Non-trainable params: 0
_________________________________________________________________
Train for 10 steps, validate for 50 steps
Epoch 1/40
2019-11-27 01:04:06.199581: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cublas64_100.dll
2019-11-27 01:04:06.430358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2019-11-27 01:04:07.180709: W tensorflow/stream_executor/cuda/redzone_allocator.cc:312] Internal: Invoking ptxas not supported on Windows
Relying on driver to perform ptx compilation. This message will be only logged once.
2019-11-27 01:04:07.425377: I tensorflow/core/profiler/lib/profiler_session.cc:184] Profiler session started.
2019-11-27 01:04:07.431736: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cupti64_100.dll
 1/10 [==>...........................] - ETA: 32s - loss: 0.6933 - accuracy: 0.4375 - categorical_accuracy: 0.4375 - precision: 0.4375 - recall: 0.43752019-11-27 01:04:07.655586: I tensorflow/core/platform/default/device_tracer.cc:588] Collecting 148 kernel records, 21 memcpy records.
WARNING: Logging before flag parsing goes to stderr.
W1127 01:04:07.730274  5696 callbacks.py:244] Method (on_train_batch_end) is slow compared to the batch update (0.138531). Check your callbacks.
 3/10 [========>.....................] - ETA: 9s - loss: 0.6167 - accuracy: 0.7000 - categorical_accuracy: 0.7000 - precision: 0.7000 - recall: 0.7000

【问题讨论】：

似乎最大的支持是在 Linux 上。目前还没有解决办法。

标签： python-3.x tensorflow tensorflow2.0

【解决方案1】：

我也受到同样问题的影响。在我的情况下，问题出在驱动程序上。

我首先使用 CUDA 10 和最新的 NVIDIA 驱动程序尝试了 tensorflow-gpu，但在训练步骤中随机卡住，只看到您正在显示的 ptxas 内容。

接下来我将 tensorflow 版本从 2.0 更改为 1.15 或 1.14，使用 Python 版本进行了调整，发现没有任何帮助。

在我卸载驱动程序并安装旧驱动程序 (432.00) 后，问题消失了，但我继续看到 ptxas 警告。

【讨论】：