【问题标题】:Keras model uses only one GPU all the timeKeras 模型一直只使用一个 GPU
【发布时间】:2019-05-05 21:09:54
【问题描述】:

我正在尝试在具有 8 个 GPU 的 AWS EC2 p3.16xlarge 实例上训练 CNN 模型。当我使用 500 的批大小时,即使系统有 8 个 GPU,但始终只使用一个 GPU。当我将批量大小增加到 1000 时,它只使用 GPU,并且与 500 的情况相比确实变慢了。如果我将批处理大小增加到 2000,则会发生内存溢出。我该如何解决这个问题?

我正在使用 tensorflow 后端。 GPU利用率如下,

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:17.0 Off |                    0 |
| N/A   47C    P0    69W / 300W |  15646MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:18.0 Off |                    0 |
| N/A   44C    P0    59W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla V100-SXM2...  On   | 00000000:00:19.0 Off |                    0 |
| N/A   45C    P0    61W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla V100-SXM2...  On   | 00000000:00:1A.0 Off |                    0 |
| N/A   47C    P0    64W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla V100-SXM2...  On   | 00000000:00:1B.0 Off |                    0 |
| N/A   48C    P0    62W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla V100-SXM2...  On   | 00000000:00:1C.0 Off |                    0 |
| N/A   46C    P0    61W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla V100-SXM2...  On   | 00000000:00:1D.0 Off |                    0 |
| N/A   46C    P0    65W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   46C    P0    63W / 300W |    502MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     15745      C   python3                                    15635MiB |
|    1     15745      C   python3                                      491MiB |
|    2     15745      C   python3                                      491MiB |
|    3     15745      C   python3                                      491MiB |
|    4     15745      C   python3                                      491MiB |
|    5     15745      C   python3                                      491MiB |
|    6     15745      C   python3                                      491MiB |
|    7     15745      C   python3                                      491MiB |
+-----------------------------------------------------------------------------+

【问题讨论】:

    标签: python tensorflow amazon-ec2 keras


    【解决方案1】:

    您可能正在寻找multiple_gpu_model。你可以在keras documentation看到。

    你可以拿你的模型做parallel_model = multi_gpu_model(model, gpus=n_gpus)

    下次别忘了添加minimal working exemple

    【讨论】:

      猜你喜欢
      • 2018-12-31
      • 2018-12-30
      • 1970-01-01
      • 2018-05-20
      • 2018-04-02
      • 2021-08-30
      • 2018-12-14
      • 2023-02-04
      • 2020-08-25
      相关资源
      最近更新 更多