【问题标题】：Does multithreading emphasize memory fragmentation?多线程强调内存碎片吗？
【发布时间】：2011-08-18 01:45:52
【问题描述】：

说明

当使用 openmp 的 parallel for 构造分配和释放具有 4 个或更多线程的随机大小的内存块时，程序似乎在 test-program's 运行时的后半部分开始泄漏大量内存。因此，它将消耗的内存从 1050 MB 增加到 1500 MB 或更多，而不会实际使用额外的内存。

由于 valgrind 没有显示任何问题，我必须假设看似内存泄漏实际上是内存碎片的突出影响。

有趣的是，如果 2 个线程每个进行 10000 次分配，效果还没有显示出来，但是如果 4 个线程每个进行 5000 次分配，效果就很明显了。此外，如果分配的块的最大大小减少到 256kb（从 1mb），效果会变弱。

重并发可以那么强调碎片化吗？还是这更有可能是堆中的错误？

测试程序说明

构建演示程序以从堆中获取总共 256 MB 随机大小的内存块，执行 5000 次分配。如果达到内存限制，首先分配的块将被释放，直到内存消耗低于限制。一旦执行了 5000 次分配，所有内存都会被释放并且循环结束。所有这些工作都是针对 openmp 生成的每个线程完成的。

这种内存分配方案允许我们预计每个线程的内存消耗约为 260 MB（包括一些簿记数据）。

演示程序

由于这确实是您可能想要测试的东西，您可以使用简单的 makefile 从dropbox 下载示例程序。

按原样运行程序时，您应该至少有 1400 MB 的可用 RAM。随意调整代码中的常量以满足您的需要。

为了完整起见，实际代码如下：

#include <stdlib.h>
#include <stdio.h>
#include <iostream>
#include <vector>
#include <deque>

#include <omp.h>
#include <math.h>

typedef unsigned long long uint64_t;

void runParallelAllocTest()
{
    // constants
    const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
    const int  NUM_THREADS = 4;       // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)
    const bool DEBUG_ALLOCS = false;  // debug output

    // pre store allocation sizes
    const int  NUM_PRE_ALLOCS = 20000;
    const uint64_t MEM_LIMIT = (1024 * 1024) * 256;   // x MB per process
    const size_t MAX_CHUNK_SIZE = 1024 * 1024 * 1;

    srand(1);
    std::vector<size_t> allocations;
    allocations.resize(NUM_PRE_ALLOCS);
    for (int i = 0; i < NUM_PRE_ALLOCS; i++) {
        allocations[i] = rand() % MAX_CHUNK_SIZE;   // use up to x MB chunks
    }


    #pragma omp parallel num_threads(NUM_THREADS)
    #pragma omp for
    for (int i = 0; i < NUM_ITERS; ++i) {
        uint64_t long totalAllocBytes = 0;
        uint64_t currAllocBytes = 0;

        std::deque< std::pair<char*, uint64_t> > pointers;
        const int myId = omp_get_thread_num();

        for (int j = 0; j < NUM_ALLOCATIONS; ++j) {
            // new allocation
            const size_t allocSize = allocations[(myId * 100 + j) % NUM_PRE_ALLOCS ];

            char* pnt = NULL;
            if (USE_NEW) {
                pnt = new char[allocSize];
            } else {
                pnt = (char*) malloc(allocSize);
            }
            pointers.push_back(std::make_pair(pnt, allocSize));

            totalAllocBytes += allocSize;
            currAllocBytes  += allocSize;

            // fill with values to add "delay"
            for (int fill = 0; fill < (int) allocSize; ++fill) {
                pnt[fill] = (char)(j % 255);
            }


            if (DEBUG_ALLOCS) {
                std::cout << "Id " << myId << " New alloc " << pointers.size() << ", bytes:" << allocSize << " at " << (uint64_t) pnt << "\n";
            }

            // free all or just a bit
            if (((j % 5) == 0) || (j == (NUM_ALLOCATIONS - 1))) {
                int frees = 0;

                // keep this much allocated
                // last check, free all
                uint64_t memLimit = MEM_LIMIT;
                if (j == NUM_ALLOCATIONS - 1) {
                    std::cout << "Id " << myId << " about to release all memory: " << (currAllocBytes / (double)(1024 * 1024)) << " MB" << std::endl;
                    memLimit = 0;
                }
                //MEM_LIMIT = 0; // DEBUG

                while (pointers.size() > 0 && (currAllocBytes > memLimit)) {
                    // free one of the first entries to allow previously obtained resources to 'live' longer
                    currAllocBytes -= pointers.front().second;
                    char* pnt       = pointers.front().first;

                    // free memory
                    if (USE_NEW) {
                        delete[] pnt;
                    } else {
                        free(pnt);
                    }

                    // update array
                    pointers.pop_front();

                    if (DEBUG_ALLOCS) {
                        std::cout << "Id " << myId << " Free'd " << pointers.size() << " at " << (uint64_t) pnt << "\n";
                    }
                    frees++;
                }
                if (DEBUG_ALLOCS) {
                    std::cout << "Frees " << frees << ", " << currAllocBytes << "/" << MEM_LIMIT << ", " << totalAllocBytes << "\n";
                }
            }
        } // for each allocation

        if (currAllocBytes != 0) {
            std::cerr << "Not all free'd!\n";
        }

        std::cout << "Id " << myId << " done, total alloc'ed " << ((double) totalAllocBytes / (double)(1024 * 1024)) << "MB \n";
    } // for each iteration

    exit(1);
}

int main(int argc, char** argv)
{
    runParallelAllocTest();

    return 0;
}

测试系统

就我目前所见，硬件非常重要。如果在更快的机器上运行，测试可能需要调整。

Intel(R) Core(TM)2 Duo CPU     T7300  @ 2.00GHz
Ubuntu 10.04 LTS 64 bit
gcc 4.3, 4.4, 4.6
3988.62 Bogomips

测试

一旦你执行了makefile，你应该得到一个名为ompmemtest的文件。为了查询一段时间内的内存使用情况，我使用了以下命令：

./ompmemtest &
top -b | grep ompmemtest

这会产生令人印象深刻的碎片或泄漏行为。 4 个线程的预期内存消耗为 1090 MB，随着时间的推移变为 1500 MB：

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
11626 byron     20   0  204m  99m 1000 R   27  2.5   0:00.81 ompmemtest                                                                              
11626 byron     20   0  992m 832m 1004 R  195 21.0   0:06.69 ompmemtest                                                                              
11626 byron     20   0 1118m 1.0g 1004 R  189 26.1   0:12.40 ompmemtest                                                                              
11626 byron     20   0 1218m 1.0g 1004 R  190 27.1   0:18.13 ompmemtest                                                                              
11626 byron     20   0 1282m 1.1g 1004 R  195 29.6   0:24.06 ompmemtest                                                                              
11626 byron     20   0 1471m 1.3g 1004 R  195 33.5   0:29.96 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  194 33.5   0:35.85 ompmemtest                                                                              
11626 byron     20   0 1469m 1.3g 1004 R  195 33.6   0:41.75 ompmemtest                                                                              
11626 byron     20   0 1636m 1.5g 1004 R  194 37.8   0:47.62 ompmemtest                                                                              
11626 byron     20   0 1660m 1.5g 1004 R  195 38.0   0:53.54 ompmemtest                                                                              
11626 byron     20   0 1669m 1.5g 1004 R  195 38.2   0:59.45 ompmemtest                                                                              
11626 byron     20   0 1664m 1.5g 1004 R  194 38.1   1:05.32 ompmemtest                                                                              
11626 byron     20   0 1724m 1.5g 1004 R  195 40.0   1:11.21 ompmemtest                                                                              
11626 byron     20   0 1724m 1.6g 1140 S  193 40.1   1:17.07 ompmemtest

请注意：我可以在使用 gcc 4.3、4.4 和 4.6(trunk) 进行编译时重现此问题。

【问题讨论】：

我想你会想使用谷歌的 tcmalloc（请参阅答案中的配置文件数据）
这是一个高度综合的测试，堆管理器的编写是为了利用程序不分配随机大小的内存块。碎片化肯定是个问题。更多的线程会更快地分裂。
这个测试确实是合成的，但它是为了弄清楚为什么我们的实际程序会出现泄漏，尽管 valgrind 没有找到任何东西。如果使用更多线程，它只会显示泄漏/碎片。由于此测试很好地重现了该问题，因此非常适合其预期目的。
纯属轶事，但我职业生涯的大部分时间都在金融行业编写大量多线程的 24/7 服务器，内存碎片从来都不是问题。
周围有许多内存分配程序（Hoard、ptmalloc、tcmalloc 等）可用于线程应用程序 - 每个程序都有一些优点和缺点，具体取决于您在做什么。前几天我在locklessinc.com/benchmarks.shtml 看到了一些你可能会感兴趣的比较。

标签： c++ multithreading memory openmp fragmentation

【解决方案1】：

是的，默认的 malloc（取决于 linux 版本）做了一些疯狂的事情，在一些多线程应用程序中会大量失败。具体来说，它几乎保留每个线程堆（竞技场）以避免锁定。这比所有线程的单个堆快得多，但大量内存效率低下（有时）。您可以通过使用关闭多个 arena 的代码来调整它（这会降低性能，所以如果您有很多小分配，请不要这样做！）

rv = mallopt(-7, 1);  // M_ARENA_TEST
rv = mallopt(-8, 1);  // M_ARENA_MAX

或者像其他人建议的那样使用 malloc 的各种替代品。

基本上，通用 malloc 不可能始终高效，因为它不知道将如何使用它。

克里斯。

【讨论】：

【解决方案2】：

好的，上钩了。

这是在带有

的系统上

Intel(R) Core(TM)2 Quad CPU    Q9550  @ 2.83GHz
4x5666.59 bogomips

Linux meerkat 2.6.35-28-generic-pae #50-Ubuntu SMP Fri Mar 18 20:43:15 UTC 2011 i686 GNU/Linux

gcc version 4.4.5

             total       used       free     shared    buffers     cached
Mem:       8127172    4220560    3906612          0     374328    2748796
-/+ buffers/cache:    1097436    7029736
Swap:            0          0          0

天真的运行

我刚刚运行了它

time ./ompmemtest 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 2 about to release all memory: 257.043 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 2 done, total alloc'ed -1569.96MB 

real    0m13.429s
user    0m44.619s
sys 0m6.000s

没什么了不起的。这里是vmstat -S M 1的同时输出

Vmstat 原始数据

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 0  0      0   3892    364   2669    0    0    24     0  701 1487  2  1 97  0
 4  0      0   3421    364   2669    0    0     0     0 1317 1953 53  7 40  0
 4  0      0   2858    364   2669    0    0     0     0 2715 5030 79 16  5  0
 4  0      0   2861    364   2669    0    0     0     0 6164 12637 76 15  9  0
 4  0      0   2853    364   2669    0    0     0     0 4845 8617 77 13 10  0
 4  0      0   2848    364   2669    0    0     0     0 3782 7084 79 13  8  0
 5  0      0   2842    364   2669    0    0     0     0 3723 6120 81 12  7  0
 4  0      0   2835    364   2669    0    0     0     0 3477 4943 84  9  7  0
 4  0      0   2834    364   2669    0    0     0     0 3273 4950 81 10  9  0
 5  0      0   2828    364   2669    0    0     0     0 3226 4812 84 11  6  0
 4  0      0   2823    364   2669    0    0     0     0 3250 4889 83 10  7  0
 4  0      0   2826    364   2669    0    0     0     0 3023 4353 85 10  6  0
 4  0      0   2817    364   2669    0    0     0     0 3176 4284 83 10  7  0
 4  0      0   2823    364   2669    0    0     0     0 3008 4063 84 10  6  0
 0  0      0   3893    364   2669    0    0     0     0 4023 4228 64 10 26  0

这些信息对你有什么意义吗？

Google Thread Caching Malloc

现在为了真正的乐趣，添加一点香料

time LD_PRELOAD="/usr/lib/libtcmalloc.so" ./ompmemtest 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 

real    0m11.663s
user    0m44.255s
sys 0m1.028s

看起来更快，不是吗？

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 4  0      0   3562    364   2684    0    0     0     0 1041 1676 28  7 64  0
 4  2      0   2806    364   2684    0    0     0   172 1641 1843 84 14  1  0
 4  0      0   2758    364   2685    0    0     0     0 1520 1009 98  2  1  0
 4  0      0   2747    364   2685    0    0     0     0 1504  859 98  2  0  0
 5  0      0   2745    364   2685    0    0     0     0 1575 1073 98  2  0  0
 5  0      0   2739    364   2685    0    0     0     0 1415  743 99  1  0  0
 4  0      0   2738    364   2685    0    0     0     0 1526  981 99  2  0  0
 4  0      0   2731    364   2685    0    0     0   684 1536  927 98  2  0  0
 4  0      0   2730    364   2685    0    0     0     0 1584 1010 99  1  0  0
 5  0      0   2730    364   2685    0    0     0     0 1461  917 99  2  0  0
 4  0      0   2729    364   2685    0    0     0     0 1561 1036 99  1  0  0
 4  0      0   2729    364   2685    0    0     0     0 1406  756 100  1  0  0
 0  0      0   3819    364   2685    0    0     0     4 1159 1476 26  3 71  0

如果您想比较 vmstat 输出

`Valgrind --tool massif`

这是valgrind --tool=massif ./ompmemtest之后ms_print的输出头（默认malloc）：

--------------------------------------------------------------------------------
Command:            ./ompmemtest
Massif arguments:   (none)
ms_print arguments: massif.out.beforetcmalloc
--------------------------------------------------------------------------------


    GB
1.009^                                                                     :  
     |       ##::::@@:::::::@@::::::@@::::@@::@::::@::::@:::::::::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |       # :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::: 
     |      :# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |     ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   ::::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |   : ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     |  :: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
     | ::: ::# :: :@ :::: ::@ : ::::@ :: :@ ::@::::@: ::@:::::: ::@::::::@::::
   0 +----------------------------------------------------------------------->Gi
     0                                                                   264.0

Number of snapshots: 63
 Detailed snapshots: [6 (peak), 10, 17, 23, 27, 30, 35, 39, 48, 56]

Google HEAPPPROFILE

不幸的是，香草valgrind 不适用于tcmalloc，所以我换了马中种to heap profiling with google-perftools

gcc openMpMemtest_Linux.cpp -fopenmp -lgomp -lstdc++ -ltcmalloc -o ompmemtest

time HEAPPROFILE=/tmp/heapprofile ./ompmemtest
Starting tracking the heap
Dumping heap profile to /tmp/heapprofile.0001.heap (100 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0002.heap (200 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0003.heap (300 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0004.heap (400 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0005.heap (501 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0006.heap (601 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0007.heap (701 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0008.heap (801 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0009.heap (902 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0010.heap (1002 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0011.heap (2029 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0012.heap (3053 MB allocated cumulatively, 1030 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0013.heap (4078 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0014.heap (5102 MB allocated cumulatively, 1031 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0015.heap (6126 MB allocated cumulatively, 1033 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0016.heap (7151 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0017.heap (8175 MB allocated cumulatively, 1029 MB currently in use)
Dumping heap profile to /tmp/heapprofile.0018.heap (9199 MB allocated cumulatively, 1028 MB currently in use)
Id 0 about to release all memory: 258.144 MB
Id 0 done, total alloc'ed -1572.7MB 
Id 2 about to release all memory: 257.043 MB
Id 2 done, total alloc'ed -1569.96MB 
Id 3 about to release all memory: 257.854 MB
Id 3 done, total alloc'ed -1569.6MB 
Id 1 about to release all memory: 257.339 MB
Id 1 done, total alloc'ed -1570.42MB 
Dumping heap profile to /tmp/heapprofile.0019.heap (Exiting)

real    0m11.981s
user    0m44.455s
sys 0m1.124s

联系我获取完整日志/详细信息

更新

致cmets：我更新了程序

--- omptest/openMpMemtest_Linux.cpp 2011-05-03 23:18:44.000000000 +0200
+++ q/openMpMemtest_Linux.cpp   2011-05-04 13:42:47.371726000 +0200
@@ -13,8 +13,8 @@
 void runParallelAllocTest()
 {
    // constants
-   const int  NUM_ALLOCATIONS = 5000; // alloc's per thread
-   const int  NUM_THREADS = 4;       // how many threads?
+   const int  NUM_ALLOCATIONS = 55000; // alloc's per thread
+   const int  NUM_THREADS = 8;        // how many threads?
    const int  NUM_ITERS = NUM_THREADS;// how many overall repetions

    const bool USE_NEW      = true;   // use new or malloc? , seems to make no difference (as it should)

它运行了超过 5 立方米。接近尾声，一张 htop 的截图告诉我们，确实，reserved set 略高，接近 2.3g：

  1  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Tasks: 125 total, 2 running
  2  [||||||||||||||||||||||||||||||||||||||||||||||||||96.7%]     Load average: 8.09 5.24 2.37 
  3  [||||||||||||||||||||||||||||||||||||||||||||||||||97.4%]     Uptime: 01:54:22
  4  [||||||||||||||||||||||||||||||||||||||||||||||||||96.1%]
  Mem[|||||||||||||||||||||||||||||||             3055/7936MB]
  Swp[                                                  0/0MB]

  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 4330 sehe        8  20   0 2635M 2286M   908 R 368. 28.8 15:35.01 ./ompmemtest

与 tcmalloc 运行比较结果：4 分 12 秒，~~相似的 top stats~~ 有细微差别；最大的区别在于 VIRT 集（但这并不是特别有用，除非每个进程的地址空间非常有限？）。如果你问我，RES 集非常相似。 需要注意的更重要的一点是增加了并行度；现在所有核心都已用尽。这显然是由于使用 tcmalloc 时减少了对堆操作的锁定需求：

If the free list is empty: (1) We fetch a bunch of objects from a central free list for this size-class (the central free list is shared by all threads). (2) Place them in the thread-local free list. (3) Return one of the newly fetched objects to the applications.

  1  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Tasks: 172 total, 2 running
  2  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Load average: 7.39 2.92 1.11 
  3  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]     Uptime: 11:12:25
  4  [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||              3278/7936MB]
  Swp[                                                                0/0MB]

  PID USER     NLWP PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
14391 sehe        8  20   0 2251M 2179M  1148 R 379. 27.5  8:08.92 ./ompmemtest

【讨论】：

感谢您提供的所有工具建议！我将自己运行您的测试，看看我得到了什么。也许地块能够给我一些碎片报告。从您的 vmstat 信息来看，您似乎没有遇到碎片问题，因为您的内存消耗保持不变。您能否运行简单的“顶部”检查（请参阅问题中的新 Testing 段落），以使结果与我得到的结果更具可比性？如果问题没有出现，请尝试将线程数增加到 8 或 16 - 也许您的处理器太快了。
我刚刚尝试了 valgrind massif，它似乎不适合在这里测量堆碎片，因为它会强制程序进入并行模式。这将强调效果降至最低，仅列出 32 MB 的额外堆数据。如果碎片达到测量值，那么我的机器上预计会有高达 400 MB 的值。
使用 8 个线程，“RES”内存永远不会超过 2.1g (4025 sehe 20 0 2410m 2.1g 908 R 314 27.4 3:16.20 ompmemtest)。显然，在 PAE 上不能真正提升到 16 个线程
对您来说，程序完全保持在预期的分配大小内，这非常有趣，因为它似乎非常依赖于硬件。我注意到程序在您的机器上运行速度快了大约 4 倍，也许您可以将 NUM_ALLOCATIONS 增加到 20000 以调整运行时间并希望重现该问题。
太好了，在您的更新中，问题也出现了。奇怪的是 tcmalloc 显示相同的顶级统计信息，其中包括增加的驻留内存。在我的机器上，与 tcmalloc 相比，使用默认堆时碎片造成的内存损失要高得多，这里似乎不是这种情况。

【解决方案3】：

将测试程序与 google 的 tcmalloc 库链接时，可执行文件不仅运行速度提高了约 10%，而且还显示出显着减少或微不足道的内存碎片：

PID   USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
13441 byron     20   0  379m 334m 1220 R  187  8.4   0:02.63 ompmemtestgoogle                                                                        
13441 byron     20   0 1085m 1.0g 1220 R  194 26.2   0:08.52 ompmemtestgoogle                                                                        
13441 byron     20   0 1111m 1.0g 1220 R  195 26.9   0:14.42 ompmemtestgoogle                                                                        
13441 byron     20   0 1131m 1.1g 1220 R  195 27.4   0:20.30 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:26.19 ompmemtestgoogle                                                                        
13441 byron     20   0 1137m 1.1g 1220 R  195 27.6   0:32.05 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  191 27.9   0:37.81 ompmemtestgoogle                                                                        
13441 byron     20   0 1149m 1.1g 1220 R  194 27.9   0:43.66 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  188 28.2   0:49.32 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  194 28.2   0:55.15 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:00.90 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1220 R  191 28.2   1:06.64 ompmemtestgoogle                                                                        
13441 byron     20   0 1161m 1.1g 1356 R  192 28.2   1:12.42 ompmemtestgoogle

从我掌握的数据来看，答案似乎是：

如果使用的堆库不能很好地处理并发访问并且处理器无法真正并发地执行线程，对堆的多线程访问会强调碎片化。

tcmalloc 库在运行之前导致约 400MB 碎片丢失的同一程序时没有显示出明显的内存碎片。

但是为什么会这样呢？

我必须在这里提供的最好的想法是堆内的某种锁定工件。

测试程序将分配随机大小的内存块，释放程序早期分配的块以保持在其内存限制内。当一个线程正在释放位于“左侧”堆块中的 old 内存时，它实际上可能会在另一个线程计划运行时停止，留下一个（软）锁那个堆块。新调度的线程想要分配内存，但可能甚至不会读取“左侧”的堆块来检查空闲内存，因为它当前正在被更改。因此，它最终可能会不必要地从“右侧”使用新的堆块。

这个过程可能看起来像一个堆块移动，其中第一个块（左侧）仅保持稀疏使用和碎片化，迫使右侧使用新块。

让我们重申，只有当我在双核系统上使用 4 个或更多线程时才会出现这种碎片问题，而双核系统只能或多或少地同时处理两个线程。当只使用两个线程时，堆上的（软）锁将保持足够短，不会阻塞想要分配内存的另一个线程。

另外，作为免责声明，我没有检查 glibc 堆实现的实际代码，我在内存分配器领域也只是新手 - 我所写的只是它在我看来的样子纯属猜测。

另一个有趣的阅读可能是tcmalloc documentation，它说明了堆和多线程访问的常见问题，其中一些可能也在测试程序中发挥了作用。

值得注意的是，它永远不会将内存返回给系统（请参阅tcmalloc documentation 中的警告段落）

【讨论】：

some of which may have played their role in the test program too -- 你在开玩笑吗？如果我没记错的话，这是综合基准测试的主题:)
我不确定是哪一个，因此文本中的may。随意改写它:)。
不，你说错了。默认堆管理器有一个全局锁（见dlmalloc）。因此，并发访问只是被序列化了。根据这些数据，你不能断定内存碎片与多线程有关。如果您确实提出要求，则必须与单线程版本进行比较，同时对堆管理器施加相同的压力。