Tensorflow 2.5 限制 GPU 内存使用答案

【问题标题】：Tensorflow 2.5 limit GPU memory usageTensorflow 2.5 限制 GPU 内存使用
【发布时间】：2021-08-23 08:21:02
【问题描述】：

我有一个管道程序，可以一次性运行三个推理过程。但是，第三个进程会在下面遇到错误。

RuntimeError: Error in virtual void* 
faiss::gpu::StandardGpuResourcesImpl::allocMemory(const 
faiss::gpu::AllocRequest&) at /__w/faiss-wheels/faiss-
wheels/faiss/faiss/gpu/StandardGpuResources.cpp:452: Error: 'err == 
cudaSuccess' failed: StandardGpuResources: alloc fail type 
TemporaryMemoryBuffer dev 0 space Device stream 0x1e9eb170 size 
1073741824 bytes (cudaMalloc error out of memory [2])

我正在使用具有 8GB VRAM 的 RTX 3070。更详细地说，前两个过程是使用预训练模型进行推理，第三个过程是使用FAISS 进行相似性搜索。当我尝试在第三个进程中将搜索索引从 CPU 移动到 GPU 时遇到错误。我需要在 GPU 上运行搜索，因为我的索引大小是百万级。

了解 Tensorflow 将在进程调用期间分配整个 GPU 内存。我尝试了在程序开头使用set_memory_growth的方法，但它仍然不起作用。

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
  except RuntimeError as e:
    print(e)

有些答案建议使用per_process_gpu_memory_fraction，但这在 TF 2.5 中不再可用。

我使用tf.config.experimental.get_memory_info('GPU:0') 跟踪内存使用情况，下面是日志。

Beginning of the program - {'current': 0, 'peak': 0}

After first inference - {'current': 281843712, 'peak': 2803776768}

After second inference - {'current': 281844480, 'peak': 2803776768}

由于前两个推理进程没有使用全部内存，我是否可以为第三个进程释放分配的内存？或者阻止 TF 2.5 分配整个内存。

【问题讨论】：

标签： python-3.x tensorflow

【解决方案1】：

批量大小对任何特定模型所需的 GPU 内存量都有重大影响。您可以尝试减小批量大小，看看是否有帮助。

【讨论】：