std::unordered_map 不释放内存答案

【问题标题】：std::unordered_map does not release memorystd::unordered_map 不释放内存
【发布时间】：2017-01-11 20:01:30
【问题描述】：

我在 MSVC14 (VS2015) 中观察到 std::unordered_map 的奇怪行为。考虑以下场景。我创建了一个无序映射并用消耗大量内存的虚拟结构填充它，比如说 1Gb，总共插入了 100k 个元素。然后您开始从地图中删除元素。假设您已经删除了一半的元素，那么您希望释放一半的内存。正确的？错误的！我看到当 map 中的元素数量超过某个阈值时释放内存，在我的例子中是 1443 个元素。
有人可能会说这是 malloc 优化使用 VirtualAllocEx 或 @987654324 从操作系统分配大块@ 实际上它并没有将内存释放回系统，因为优化决定了策略，并且可能不会调用 HeapFree 以供将来重用已分配的内存。
为了消除这种情况，我为 allocate_shared 使用了自定义分配器，它没有成功。所以主要的问题是为什么会发生这种情况以及可以做些什么来“压缩”unordered_map 使用的内存？
代码

#include <windows.h>
#include <memory>
#include <vector>
#include <map>
#include <unordered_map>
#include <random>
#include <thread>
#include <iostream>
#include <allocators>

HANDLE heap = HeapCreate(0, 0, 0);
template <class Tp>
struct SimpleAllocator
{
    typedef Tp value_type;
    SimpleAllocator() noexcept
    {}
    template <typename U>
    SimpleAllocator(const SimpleAllocator<U>& other) throw()
    {};
    Tp* allocate(std::size_t n)
    {
        return static_cast<Tp*>(HeapAlloc(heap, 0, n * sizeof(Tp)));
    }
    void deallocate(Tp* p, std::size_t n)
    {
        HeapFree(heap, 0, p);
    }
};
template <class T, class U>
bool operator==(const SimpleAllocator<T>&, const SimpleAllocator<U>&)
{
    return true;
}
template <class T, class U>
bool operator!=(const SimpleAllocator<T>& a, const SimpleAllocator<U>& b)
{
    return !(a == b);
}

struct Entity
{
    Entity()
    {
        _6 = std::string("a", dis(gen));
        _7 = std::string("b", dis(gen));
        for(size_t i = 0; i < dis(gen); ++i)
        {
            _9.emplace(i, std::string("c", dis(gen)));
        }
    }
    int _1 = 1;
    int _2 = 2;
    double _3 = 3;
    double _4 = 5;
    float _5 = 3.14f;
    std::string _6 = "hello world!";
    std::string _7 = "A quick brown fox jumps over the lazy dog.";
    std::vector<unsigned long long> _8 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
    std::map<long long, std::string> _9 = {{0, "a"},{1, "b"},{2, "c"},{3, "d"},{4, "e"},
    {5, "f"},{6, "g"},{7, "h"},{8, "e"},{9, "j"}};
    std::vector<double> _10{1000, 3.14};
    std::random_device rd;
    std::mt19937 gen = std::mt19937(rd());
    std::uniform_int_distribution<size_t> dis = std::uniform_int_distribution<size_t>(16, 256);
};

using Container = std::unordered_map<long long, std::shared_ptr<Entity>>;

void printContainerInfo(std::shared_ptr<Container> container)
{
    std::cout << std::chrono::system_clock::to_time_t(std::chrono::system_clock::now())
        << ", Size: " << container->size() << ", Bucket count: " << container->bucket_count()
        << ", Load factor: " << container->load_factor() << ", Max load factor: " << container->max_load_factor()
        << std::endl;
}

int main()
{
    constexpr size_t maxEntites = 100'000;
    constexpr size_t ps = 10'000;
    stdext::allocators::allocator_chunklist<Entity> _allocator;
    std::shared_ptr<Container> test = std::make_shared<Container>();
    test->reserve(maxEntites);

    for(size_t i = 0; i < maxEntites; ++i)
    {
        test->emplace(i, std::make_shared<Entity>());
    }

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<size_t> dis(0, maxEntites);
    size_t cycles = 0;
    while(test->size() > 0)
    {
        size_t counter = 0;
        std::cout << "Press any key..." << std::endl;
        std::cin.get();
        while(test->size() > 1443)
        {
            test->erase(dis(gen));
        }
        printContainerInfo(test);
        std::cout << "Press any key..." << std::endl;
        std::cin.get();
        std::cout << std::endl;
    }
    return 0;
}

到目前为止我尝试过的事情：当负载因子达到某个阈值时尝试重新散列/调整大小 - 在擦除 while 中添加类似这样的内容

if(test->load_factor() < 0.2)
{
    test->max_load_factor(1 / test->load_factor());
    test->rehash(test->size());
    test->reserve(test->size());
    printContainerInfo(test);
    test->max_load_factor(1);
    test->rehash(test->size());
    test->reserve(test->size());
}

然后当它没有帮助尝试一些愚蠢的事情时，比如创建临时容器，复制/移动剩余的条目，清除原始条目，然后从临时复制/移动回原始条目。像这样的

if(test->load_factor() < 0.2)
{
    Container tmp;
    std::copy(test->begin(), test->end(), std::inserter(tmp, tmp.begin()));
    test->clear();
    test.reset();
    test = std::make_shared<Container>();
    std::copy(tmp.begin(), tmp.end(), std::inserter(*test, test->begin()));
}

最后，将shared_ptr 替换为allocate_shared 并将SimpleAllocator 实例传递给它。
此外，我还到处修改了STL 代码，例如在std::unordered_map's @ 上调用std::vector::shrink_to_fit 987654337@（unordered_map 的 msvc stl 实现基于list 和vector），它也不起作用。

EDIT001：适用于所有非信徒。以下代码与前面的代码大致相同，但使用std::vector<Entity> 而不是unordered_map。内存被操作系统回收。

#include <memory>
#include <vector>
#include <map>
#include <random>
#include <thread>
#include <iostream>

struct Entity
{
    Entity()
    {
        _6 = std::string("a", dis(gen));
        _7 = std::string("b", dis(gen));
        for(size_t i = 0; i < dis(gen); ++i)
        {
            _9.emplace(i, std::string("c", dis(gen)));
        }
    }
    int _1 = 1;
    int _2 = 2;
    double _3 = 3;
    double _4 = 5;
    float _5 = 3.14f;
    std::string _6 = "hello world!";
    std::string _7 = "A quick brown fox jumps over the lazy dog.";
    std::vector<unsigned long long> _8 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
    std::map<long long, std::string> _9 = {{0, "a"}, {1, "b"}, {2, "c"}, {3, "d"}, {4, "e"},
                                           {5, "f"}, {6, "g"}, {7, "h"}, {8, "e"}, {9, "j"}};
    std::vector<double> _10{1000, 3.14};
    std::random_device rd;
    std::mt19937 gen = std::mt19937(rd());
    std::uniform_int_distribution<size_t> dis = std::uniform_int_distribution<size_t>(16, 256);
};

using Container = std::vector<std::shared_ptr<Entity>>;

void printContainerInfo(std::shared_ptr<Container> container)
{
    std::cout << std::chrono::system_clock::to_time_t(std::chrono::system_clock::now())
              << ", Size: " << container->size() << ", Capacity: " << container->capacity() << std::endl;
}

int main()
{
    constexpr size_t maxEntites = 100'000;
    constexpr size_t ps = 10'000;
    std::shared_ptr<Container> test = std::make_shared<Container>();
    test->reserve(maxEntites);

    for(size_t i = 0; i < maxEntites; ++i)
    {
        test->emplace_back(std::make_shared<Entity>());
    }

    std::random_device rd;
    std::mt19937 gen(rd());
    size_t cycles = 0;
    while(test->size() > 0)
    {
        std::uniform_int_distribution<size_t> dis(0, test->size());
        size_t counter = 0;
        while(test->size() > 0 && counter < ps)
        {
            test->pop_back();
            ++counter;
        }
        ++cycles;
        if(cycles % 7 == 0)
        {
            std::cout << "Inflating..." << std::endl;
            while(test->size() < maxEntites)
            {
                test->emplace_back(std::make_shared<Entity>());
            }
        }
        std::this_thread::sleep_for(std::chrono::seconds(1));
        printContainerInfo(test);
        std::cout << std::endl;
    }
    return 0;
}

【问题讨论】：

你怎么知道内存没有被释放？
查看任务管理器中的提交大小或 Sysinternals 的 RAMMap 中的“总内存”
@kreuzerkrieg 释放的内存实际上并没有从正在运行的进程返回给操作系统。您将无法在任务管理器中看到它。使用 valgrind 之类的工具来检测内存泄漏。
a) 没有内存泄漏 b) 所以，正如你所说，如果我用 vector<Entity> 替换 unordered_map 我会看到同样的行为吗？错误的！内存将立即释放到操作系统。此外，如果您的说法是正确的，clean 不会释放内存，但它确实会释放内存，此外，当unordered_map 中的项目数低于〜1400 项时，为什么操作系统会回收内存？
@πάνταῥεῖ ῥεῖ，为什么它与要求 Linux 和 glibc 的问题重复，涉及所有领域的东西？

标签： c++ visual-c++ memory-management stl unordered-map

【解决方案1】：

你是正确的，但部分正确。

在 VC++ 中实现 C++ unordered_map 的方式是使用一个内部的 std::vector，它是 桶列表，以及一个 std::list，它保存地图。

在图表中，它看起来像这样：

buckets : [][][*][][][][][][*][][][][][][*]
               |            |            |
               |            |            | 
             ---             ------      |
             |                    |      |
             V                    V      V
elements: [1,3]->[5,7]->[7,1]->[8,11]->[10,3]->[-1,2]

现在，当您擦除节点时，它们实际上已从列表中删除，但它没有说明 buckets 数组。 buckets 数组在达到某个阈值后会调整大小（每个桶的元素过多，或者对于元素数量而言，桶太多）。

也证明了我的观点，下面是一个用最新的VC++编译的例子：

std::unordered_map<int, std::vector<char>> map;
for (auto i = 0; i < 1000; i++) {
    map.emplace(i, std::vector<char>(10000));
}

for (auto i = 0; i < 900; i++) {
    map.erase(i);
}

查看调试器中的原始视图，我们看到：

+       _List   { size=100 }    std::list<std::pair<int const ,std::vector<char,std::allocator<char> > >,std::allocator<std::pair<int const ,std::vector<char,std::allocator<char> > > > >
+       _Vec    { size=2048 }   std::vector<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,std::vector<char,std::allocator<char> > > > > >,std::_Wrap_alloc<std::allocator<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,std::vector<char,std::allocator<char> > > > > > > > >

意味着虽然我们只有 100 个元素，但地图保留了 2048 个桶。

因此，当您删除元素时，并非所有内存都会被释放。 map 维护了另一部分内存来保存存储桶本身，并且该内存比元素内存更顽固。

编辑：
让我们更加狂野！

std::unordered_map<int, std::vector<char>> map;
for (auto i = 0; i < 100'000; i++) {
    map.emplace(i, std::vector<char>(10000));
}

for (auto i = 0; i < 90'000; i++) {
    map.erase(i);
}

擦除循环结束时的结果：

+       _List   { size=10000 }  std::list<std::pair<int const ,std::vector<char,std::allocator<char> > >,std::allocator<std::pair<int const ,std::vector<char,std::allocator<char> > > > >
+       _Vec    { size=262144 } std::vector<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,std::vector<char,std::allocator<char> > > > > >,std::_Wrap_alloc<std::allocator<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,std::vector<char,std::allocator<char> > > > > > > > >

现在，在 64 位上，std::_List_unchecked_iterator<...> 的大小为 8 个字节。我们有 262144 个，所以我们持有 262144*8/(1024*1024) = 2MB 的几乎未使用的数据。 这是您看到的高内存使用率。

在删除所有多余节点后调用map.rehash(1024*10) 似乎有助于减少内存消耗：

+       _List   { size=10000 }  std::list<std::pair<int const ,std::vector<char,std::allocator<char> > >,std::allocator<std::pair<int const ,std::vector<char,std::allocator<char> > > > >
+       _Vec    { size=32768 }  std::vector<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,std::vector<char,std::allocator<char> > > > > >,std::_Wrap_alloc<std::allocator<std::_List_unchecked_iterator<std::_List_val<std::_List_simple_types<std::pair<int const ,std::vector<char,std::allocator<char> > > > > > > > >

这就是您正在寻找的解决方案。

（PS：我最近违背自己的意愿做了很多 .NET。这个问题确实展示了 C++ 的优点：我们可以使用调试器进入标准库代码，确切了解事情发生的方式和时间，然后我们可以随后对它们采取行动。如果可能的话，在 .NET 中做这样的事情将是一个活生生的地狱。）

【讨论】：

David，正如我在问题中所述，我已将 STL unordered_map 实现修改为 shrink_to_fit 存储桶版本，但它不起作用。除此之外，您的帖子在描述 unordered_map 实现方面 100% 正确
@kreuzerkrieg shrink_to_fit 与您的问题无关，因为 buckets 向量实际上使用了内存。 shrink_to_fit 删除向量末尾的多余容量。在我们的示例中，如果向量只有 100 个元素有 2048 个桶，则缩小以适应将无济于事。
这都是关于shrink_to_fit 的，你误以为重新散列会改变任何事情。它在大小方面而不是容量方面。在reshash 函数中设置断点，然后在_Init(_Newsize); 中，您将看到它只是调整大小并保留向量，这两者都不会影响vector 的内部容量，除非新大小大于先前分配的大小。所以实际上你浪费了所有这些填充capacity的指针
此外，您还节省了 2Mb...您看到应用程序的内存占用了吗？它应该是〜1Gb，对吧？重新散列后它占用了多少内存？ 1Gb - 2Mb？我们不开心:)
@kreuzerkrieg 10000*10000/(1024*1024) = 95MB。此外，如果你的元素还活着，你不能指望他们的记忆是不可见的。另外，buckets 数组每次将自身乘以 2，我只能猜测，如果您放置大量元素（几百万），内存消耗会非常显着

【解决方案2】：

好的，在向 Microsoft 打开高级支持票后，我得到了以下答案。大部分我们已经知道了，但还有一些我们没有考虑过。

在 Windows 中，内存以页的形式在堆中分配

在STL中没有任何缓存，我们在你调用erase之后直接调用RtlHeapFree

您看到的是 Windows 如何管理堆

一旦您将某些内容标记为删除，它可能不会返回到没有内存压力的操作系统，它可能会决定将来重新分配内存不仅仅是将其保留在过程

这就是任何堆算法的工作原理

要考虑的另一件事是；如果您要删除的值恰好分布在页面上；并且除非所有值页面内部是空的，它将驻留在内存中

如果您非常注重立即减少私有字节，您可能必须编写自己的内存管理器，而不是依赖于 Windows 堆句柄。

重点是我的。我想它回答了这个问题，或者问题就像“这就是 Windows 堆管理的工作方式”一样简单。无论如何，对于这个问题没有（简单的）解决方案，也许最好使用类似 boost::intrusive 容器的东西，理论上应该提供更好的局部性，这样 Windows 内存管理器就有更好的机会将内存返回给操作系统。

更新001： Boost 侵入式容器也没有用。

struct Entity : public boost::intrusive::unordered_set_base_hook<>
{
    explicit Entity(size_t id)
    {
        first = id;
        _6 = std::string("a", dis(gen));
        _7 = std::string("b", dis(gen));
        for(size_t i = 0; i < dis(gen); ++i)
        {
            _9.emplace(i, std::string("c", dis(gen)));
        }
    }

    size_t first = 1;
    int _1 = 1;
    int _2 = 2;
    float _5 = 3.14f;
    double _3 = 3;
    double _4 = 5;
    std::string _6 = "hello world!";
    std::string _7 = "A quick brown fox jumps over the lazy dog.";
    std::vector<unsigned long long> _8 = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0};
    std::map<long long, std::string> _9 = {{0, "a"}, {1, "b"}, {2, "c"}, {3, "d"}, {4, "e"},
                                           {5, "f"}, {6, "g"}, {7, "h"}, {8, "e"}, {9, "j"}};
    std::vector<double> _10{1000, 3.14};
    std::random_device rd;
    std::mt19937 gen = std::mt19937(rd());
    std::uniform_int_distribution<size_t> dis = std::uniform_int_distribution<size_t>(16, 256);
};

struct first_is_key
{
    typedef size_t type;

    const type& operator()(const Entity& v) const { return v.first; }
};

using Container = boost::intrusive::unordered_set<Entity, boost::intrusive::key_of_value<first_is_key>>;

void printContainerInfo(const Container& container)
{
    std::cout << std::chrono::system_clock::to_time_t(std::chrono::system_clock::now())
              << ", Size: " << container.size() << ", Bucket count: " << container.bucket_count() << std::endl;
}

int main()
{
    constexpr size_t maxEntites = 100'000;
    Container::bucket_type* base_buckets = new Container::bucket_type[maxEntites];
    Container test(Container::bucket_traits(base_buckets, maxEntites));

    std::random_device rd;
    std::mt19937 gen(rd());
    std::uniform_int_distribution<size_t> dis;

    while(test.size() < maxEntites)
    {
        auto data = new Entity(dis(gen));
        auto res = test.insert(*data);
        if(!res.second)
        {
            delete data;
        }
    }

    printContainerInfo(test);
    while(test.size() > 0)
    {
        while(test.size() > maxEntites * 2 / 3)
        {
            test.erase_and_dispose(test.begin(), [](Entity* entity)
                                   {
                                       delete entity;
                                   });
        }

        printContainerInfo(test);
        while(test.size() < maxEntites)
        {
            auto data = new Entity(dis(gen));
            auto res = test.insert(*data);
            if(!res.second)
            {
                delete data;
            }
        }
    }
    return 0;
}

【讨论】：

【解决方案3】：

假设您已经删除了一半的元素，然后，您希望释放一半的内存。对吧？

其实没有。我希望根据程序执行效率来编写内存分配器。我希望它分配比它需要的更多的内存，并且只有在被命令或确定不再需要内存时才将该内存释放回操作系统。

我希望内存块尽可能频繁地在用户空间中重复使用，并且它们被分配在连续的块中。

对于大多数应用程序来说，从操作系统分配内存并在对象被销毁的那一刻将其返回的迂腐内存分配器会导致程序极其缓慢和大量磁盘抖动。这也（在实践中）意味着在所有流行的操作系统上，即使是最小的 40 字节字符串也会被分配它自己的 4k 页面，因为英特尔芯片组只能在这个大小的页面中处理保护内存（或者在某些页面上可能更大）系统？）

【讨论】：

正确，它应该是这样工作的，但是您仍然希望不时运行 24/7 的服务器应用程序释放内存，因为数据工作集没有增长，它不能永远成长。 AFAIR，在 MSVC 中 malloc 具有分配块和管理内存子块的魔力，这与我使用自定义分配器时观察到的情况相矛盾，该分配器使用直接操作系统调用来分配/释放内存，但它被写入检查是否效率低下malloc 是罪魁祸首
@kreuzerkrieg 在内存管理系统中没有真正需要将内存释放回操作系统。你真正分配的是虚拟页面。不需要时，它们将被操作系统换出。您消耗的唯一真正资源是磁盘空间。
然后你会遇到页面错误并且你的延迟会飙升，这不是你在服务器应用程序中想要的。除了操作系统的东西，如果你使用vector而不是unordered_map并开始从中弹出元素，为什么内存会被释放回操作系统？