如何确保 PyTorch 已释放 GPU 内存？答案

【问题标题】：How to make sure PyTorch has deallocated GPU memory?如何确保 PyTorch 已释放 GPU 内存？
【发布时间】：2020-11-18 13:53:46
【问题描述】：

假设我们有这样的函数：

        def trn_l(totall_lc, totall_lw, totall_li, totall_lr):
            self.model_large.cuda()
            self.model_large.train()
            self.optimizer_large.zero_grad()

            for fb in range(self.fake_batch):
                val_x, val_y = next(self.valid_loader)
                val_x, val_y = val_x.cuda(), val_y.cuda()

                logits_main, emsemble_logits_main = self.model_large(val_x)
                cel = self.criterion(logits_main, val_y)
                loss_weight = cel / (self.fake_batch)
                loss_weight.backward(retain_graph=False)
                cel = cel.cpu().detach()
                emsemble_logits_main = emsemble_logits_main.cpu().detach()

                totall_lw += float(loss_weight.item())
                val_x = val_x.cpu().detach() 
                val_y = val_y.cpu().detach()

            loss_weight = loss_weight.cpu().detach()
            self._clip_grad_norm(self.model_large)
            self.optimizer_large.step()
            self.model_large.train(mode=False)
            self.model_large = self.model_large.cpu()
            return totall_lc, totall_lw, totall_li, totall_lr

在第一次调用时，它会分配 8GB 的 GPU 内存。在下一次调用中，没有分配新的内存，但仍然占用了 8GB。我希望在调用它并产生第一个结果后分配 0 个 GPU 内存或尽可能低。

我尝试过的：到处都做retain_graph=False 和.cpu().detach() - 没有积极的影响。

之前的内存快照

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   33100 KB |   33219 KB |   40555 KB |    7455 KB |
|       from large pool |    3072 KB |    3072 KB |    3072 KB |       0 KB |
|       from small pool |   30028 KB |   30147 KB |   37483 KB |    7455 KB |
|---------------------------------------------------------------------------|
| Active memory         |   33100 KB |   33219 KB |   40555 KB |    7455 KB |
|       from large pool |    3072 KB |    3072 KB |    3072 KB |       0 KB |
|       from small pool |   30028 KB |   30147 KB |   37483 KB |    7455 KB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |   51200 KB |   51200 KB |   51200 KB |       0 B  |
|       from large pool |   20480 KB |   20480 KB |   20480 KB |       0 B  |
|       from small pool |   30720 KB |   30720 KB |   30720 KB |       0 B  |
|---------------------------------------------------------------------------|
| Non-releasable memory |   18100 KB |   20926 KB |   56892 KB |   38792 KB |
|       from large pool |   17408 KB |   18944 KB |   18944 KB |    1536 KB |
|       from small pool |     692 KB |    2047 KB |   37948 KB |   37256 KB |
|---------------------------------------------------------------------------|
| Allocations           |   12281    |   12414    |   12912    |     631    |
|       from large pool |       2    |       2    |       2    |       0    |
|       from small pool |   12279    |   12412    |   12910    |     631    |
|---------------------------------------------------------------------------|
| Active allocs         |   12281    |   12414    |   12912    |     631    |
|       from large pool |       2    |       2    |       2    |       0    |
|       from small pool |   12279    |   12412    |   12910    |     631    |
|---------------------------------------------------------------------------|
| GPU reserved segments |      16    |      16    |      16    |       0    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |      15    |      15    |      15    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |       3    |      30    |     262    |     259    |
|       from large pool |       1    |       1    |       1    |       0    |
|       from small pool |       2    |      29    |     261    |     259    |
|===========================================================================|

在调用函数和之后

torch.cuda.empty_cache()
torch.cuda.synchronize()

我们得到：

|===========================================================================|
|                  PyTorch CUDA memory summary, device ID 0                 |
|---------------------------------------------------------------------------|
|            CUDA OOMs: 0            |        cudaMalloc retries: 0         |
|===========================================================================|
|        Metric         | Cur Usage  | Peak Usage | Tot Alloc  | Tot Freed  |
|---------------------------------------------------------------------------|
| Allocated memory      |   10957 KB |    8626 MB |  272815 MB |  272804 MB |
|       from large pool |       0 KB |    8596 MB |  272477 MB |  272477 MB |
|       from small pool |   10957 KB |      33 MB |     337 MB |     327 MB |
|---------------------------------------------------------------------------|
| Active memory         |   10957 KB |    8626 MB |  272815 MB |  272804 MB |
|       from large pool |       0 KB |    8596 MB |  272477 MB |  272477 MB |
|       from small pool |   10957 KB |      33 MB |     337 MB |     327 MB |
|---------------------------------------------------------------------------|
| GPU reserved memory   |    8818 MB |    9906 MB |   19618 MB |   10800 MB |
|       from large pool |    8784 MB |    9874 MB |   19584 MB |   10800 MB |
|       from small pool |      34 MB |      34 MB |      34 MB |       0 MB |
|---------------------------------------------------------------------------|
| Non-releasable memory |    5427 KB |    3850 MB |  207855 MB |  207850 MB |
|       from large pool |       0 KB |    3850 MB |  207494 MB |  207494 MB |
|       from small pool |    5427 KB |       5 MB |     360 MB |     355 MB |
|---------------------------------------------------------------------------|
| Allocations           |    3853    |   13391    |   34339    |   30486    |
|       from large pool |       0    |     557    |   12392    |   12392    |
|       from small pool |    3853    |   12838    |   21947    |   18094    |
|---------------------------------------------------------------------------|
| Active allocs         |    3853    |   13391    |   34339    |   30486    |
|       from large pool |       0    |     557    |   12392    |   12392    |
|       from small pool |    3853    |   12838    |   21947    |   18094    |
|---------------------------------------------------------------------------|
| GPU reserved segments |     226    |     226    |     410    |     184    |
|       from large pool |     209    |     209    |     393    |     184    |
|       from small pool |      17    |      17    |      17    |       0    |
|---------------------------------------------------------------------------|
| Non-releasable allocs |      46    |     358    |   12284    |   12238    |
|       from large pool |       0    |     212    |    7845    |    7845    |
|       from small pool |      46    |     279    |    4439    |    4393    |
|===========================================================================|

【问题讨论】：

标签： python memory-management pytorch gpu allocation

【解决方案1】：

我不认为另一个答案是正确的。分配和解除分配肯定会在运行时发生，需要注意的是 CPU 代码与 GPU 代码异步运行，因此如果您想在它之后保留更多内存，则需要等待任何解除分配发生。看看这个：

import torch 

a = torch.zeros(100,100,100).cuda()

print(torch.cuda.memory_allocated())

del a
torch.cuda.synchronize()
print(torch.cuda.memory_allocated())

输出

4000256
0

因此，您应该 del 不需要的张量并调用 torch.cuda.synchronize() 以确保在您的 CPU 代码继续运行之前完成释放。

在您的特定情况下，在您的函数 trn_l 返回后，该函数的本地变量，并且在其他地方没有引用，将与相应的 GPU 张量一起被释放。您需要做的就是在函数调用之后调用torch.cuda.synchronize() 等待这发生。

【讨论】：

【解决方案2】：

因此，Pytorch 不会在训练时从 GPU 分配和释放内存。

来自https://pytorch.org/docs/stable/notes/faq.html#my-gpu-memory-isn-t-freed-properly：

PyTorch 使用缓存内存分配器来加速内存分配。因此，nvidia-smi 中显示的值通常不能反映真实的内存使用情况。有关 GPU 内存管理的更多详细信息，请参阅Memory management。

如果您的 GPU 内存在 Python 退出后仍未释放，则很可能某些 Python 子进程仍然存在。您可以通过 ps -elf | 找到它们。 grep python 并使用 kill -9 [pid] 手动杀死它们。

您可以调用torch.cuda.empty_cache() 来释放所有未使用的内存（但是，这并不是一个很好的做法，因为内存重新分配非常耗时）。 empty_cace() 的文档：https://pytorch.org/docs/stable/cuda.html#torch.cuda.empty_cache

【讨论】：

问题在于函数调用后释放了多少内存，而不是进程完成时。