【问题标题】：Scalable Memory Allocation using INTEL TBB使用 INTEL TBB 的可扩展内存分配
【发布时间】：2017-10-02 09:59:47
【问题描述】：

我想在 RAM 上分配大约 40 GB。我的第一次尝试是：

#include <iostream>
#include <ctime>

int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];  // 3GB/s  40GB / 13.7 s
    unsigned long long i = 0;
    const clock_t begintime = clock(); 
    for (i = 0; i < ARRAYSIZE; ++i){
    myBuff[i] = 0;
    }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
    std::cin.get();
    delete [] myBuff;
    return 0;
}

内存写入速度约为 3 GB/s，这对于我的高性能系统来说并不令人满意。

所以我尝试了 Intel Cilk Plus 如下：

    /*
    nworkers =  5;  8.5 s ==> 4.7 GB/s
    nworkers =  8;  8.2 s ==> 4.8 GB/s
    nworkers =  10; 9   s ==> 4.5 GB/s
    nworkers =  32; 15  s ==> 2.6 GB/s
    */

#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>

int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
    if (0 != __cilkrts_set_param("nworkers", "32")){
    std::cout << "Error" << std::endl;
    }
    const clock_t begintime = clock();
    cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
    myBuff[j] = 0;
    }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
    std::cin.get();
    delete [] myBuff;
    return 0;
}

结果在代码上方注释。可以看出，nworkers = 8 有加速。但 nworker 越大，分配越慢。我想可能是由于线程锁定。所以我尝试了英特尔 TBB 提供的可扩展分配器：

#include "tbb\task_scheduler_init.h"
#include "tbb\blocked_range.h"
#include "tbb\parallel_for.h"
#include "tbb\scalable_allocator.h"
#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>
// No retry loop because we assume that scalable_malloc does
// all it takes to allocate the memory, so calling it repeatedly
// will not improve the situation at all
//
// No use of std::new_handler because it cannot be done in portable
// and thread-safe way (see sidebar)
//
// We throw std::bad_alloc() when scalable_malloc returns NULL
//(we return NULL if it is a no-throw implementation)

void* operator new (size_t size) throw (std::bad_alloc)
{
    if (size == 0) size = 1;
    if (void* ptr = scalable_malloc(size))
        return ptr;
    throw std::bad_alloc();
}

void* operator new[](size_t size) throw (std::bad_alloc)
{
    return operator new (size);
}

void* operator new (size_t size, const std::nothrow_t&) throw ()
{
    if (size == 0) size = 1;
    if (void* ptr = scalable_malloc(size))
        return ptr;
    return NULL;
}

void* operator new[](size_t size, const std::nothrow_t&) throw ()
{
    return operator new (size, std::nothrow);
}

void operator delete (void* ptr) throw ()
{
    if (ptr != 0) scalable_free(ptr);
}

void operator delete[](void* ptr) throw ()
{
    operator delete (ptr);
}

void operator delete (void* ptr, const std::nothrow_t&) throw ()
{
    if (ptr != 0) scalable_free(ptr);
}

void operator delete[](void* ptr, const std::nothrow_t&) throw ()
{
    operator delete (ptr, std::nothrow);
}



int main(int argc, char** argv)
{
    unsigned long long  ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
    tbb::task_scheduler_init tbb_init;
    unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
    if (0 != __cilkrts_set_param("nworkers", "10")){
        std::cout << "Error" << std::endl;
    }
    const clock_t begintime = clock();
    cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
        myBuff[j] = 0;
        }
    std::cout << "finish:  " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;

    std::cin.get();
    delete [] myBuff;
    return 0;
}

（以上代码改编自 James Reinders, O'REILLY 的 Intel TBB 书籍）但结果与之前的尝试几乎相同。我设置了 TBB_VERSION 环境变量，看看我是否真的使用 Scalable_malloc 和得到的信息在这张图片中（nworkers = 32）：

https://www.dropbox.com/s/y1vril3f19mkf66/TBB_Info.png?dl=0

我愿意知道我的代码有什么问题。我预计内存写入速度至少约为 40 GB/s。
我应该如何正确使用可扩展分配器？
有人可以提供一个使用可扩展的简单验证示例来自 INTEL TBB 的分配器？

环境： Intel Xeon CPU E5-2690 0 @ 2.90 GHz（2 个处理器），224 GB RAM (2 * 7 * 16 GB) DDR3 1600 MHz，Windows server 2008 R2 Datacenter， Microsoft Visual Studio 2013 和 Intel C++ 编译器 2017。

【问题讨论】：

你说性能不理想。是什么让您认为您应该能够写入至少 40GB/s 的速度？
根据系统配置。当然，初始化后内存写入速度约为50GB/s。
您似乎使用正确，只是您从未释放分配的内存。但是您的问题实际上是一团糟，因为您突然从正确的分配使用切换到内存写入速度期望。更糟糕的是，您试图通过运行一个毫无意义的数组填充循环来测量它，这肯定是eliminated by compilator in release mode。
@gnts B 可以代表字节或位。例如，如果您在指定为 50 GB 的系统上测量 5 GB，那么您的容量是理论容量的 80%。我无法确定英特尔在营销中使用了哪些单位。
如果您启用了优化器，编译器将在您的第一个示例中丢弃循环。启用优化器！

标签： c++ memory-management tbb scalable

【解决方案1】：

会发生什么

来自wikipedia：“DDR3-xxx 表示数据传输速率，描述 DDR 芯片，而 PC3-xxxx 表示理论带宽（最后两位被截断），用于描述组装的 DIMM。带宽计算公式为每秒进行传输并乘以 8。这是因为 DDR3 内存模块在 64 个数据位宽的总线上传输数据，并且由于一个字节包含 8 个位，这相当于每次传输 8 个字节的数据。"

所以单个模块 DDR3-1600 最大可以达到 1600*8 = 12800 MB/s 让您的系统拥有 4 个通道（每个处理器），您应该能够达到：

12800 * 4 = 51200 MB/s - 51.2 GB/s，这就是CPU specifications中所说的

和

你有两个 CPU 和 8 个通道：你应该能够达到它的两倍，并行工作。但是，您的系统是 NUMA 系统 - 在这种情况下，内存放置很重要...

但是

您可以为每个频道放置多个内存库。当在通道中放置更多模块时，您会减少可用的时间 - 例如，PC-1600 的行为可能与 PC-1333 或更少 - 这通常在主板规格中报告。示例here。

您有七个模块 - 您的频道填充不相等...您的带宽受到最慢频道的限制。建议将通道填充到彼此相等。

如果您降频到 1333，您可以预期： 1333 * 8 = 10666 MB/s 每通道：

每个 CPU 42 GB/秒

然而

通道在寻址空间中交错分布，在将内存块归零时会使用所有通道。只有在使用条带访问访问内存时，才会遇到性能问题。

内存分配不是内存访问

TBB 可扩展分配让许多线程优化内存分配。也就是说，分配时没有全局锁，内存分配不会受到其他线程活动的限制。这就是操作系统分配器中经常发生的事情。

在您的示例中，您根本没有使用很多分配，只使用一个主线程。您正在尝试获得最大内存带宽。使用不同的分配器时，内存访问不会改变。

阅读 cmets 我看到您想要优化内存访问。

优化内存访问

用一次调用 memset() 替换归零循环，并让编译器对其进行优化/内联。 - /O2 应该足够了。

基本原理

英特尔编译器将许多库调用（memset、memcpy、...）替换为优化的内在函数/内联调用。在这种情况下 - 即将一大块内存归零 - 内联并不重要，但使用优化的内在函数非常重要：它将使用流指令的优化版本：SSE4.2 / AVX

然而，基本的 libc memset 将胜过任何手写循环。至少在 Linux 上。

【讨论】：

In your example you are not using many allocations at all, just one main thread. 你是什么意思？我使用Cilk_for 来初始化我的数组。因此采用多线程写入内存。另一件事要提到的是，当我第二次写入我的阵列时，我获得了最大的内存带宽（大约 72 GB/s）。 IOW 之后，我在物理 RAM 上确实有一些空间（第一次触摸概念），我获得了最大的内存带宽。但第一次如前所述，执行缓慢！
In your example you are not using many allocations at all, just one main thread. 那么你有什么建议不要只使用一个主线程进行分配？谢谢。
尝试第一个代码用 memset() 替换 for 循环。并找出您的内存模块的确切频率，以找到您应该获得的实际带宽。
我尝试了第一个代码并使用memset() 而不是for 循环。 40 GB 的时间是 12.52 秒，这意味着 3.3 GB/秒。没有明显的加速！

【解决方案2】：

我至少可以告诉你为什么你没有超过 25 个

根据英特尔的说法，您的 CPU 的最大 RAM 带宽为 51.2GB/s 根据维基百科，DDR3-1600 的最大带宽为 25.6GB/s

这意味着必须使用至少 2 个 RAM 通道才能预期超过 25 个。如果您想要接近 40-50，这几乎是恒定的。

为此，您必须知道操作系统如何在 ram 插槽之间拆分内存地址，并并行化循环，以使并行的内存访问实际上位于可以并行访问的 2 个 ram 地址上。如果并行化访问接近的“相同”时间地址，它们很可能在同一个 ram 棒上并且只使用一个 ram 通道，从而将速率限制在理论上的 25GB/s。您甚至可能需要能够在多个 ram 插槽中的单独地址上将分配拆分为块的东西，具体取决于 ram 地址在插槽上的并行化方式。

【讨论】：

我认为使用了四个通道。我使用来自 INTEL TBB 的scalable_allocator 并行分配内存。你的意思是不够吗？我应该对我的操作系统做些什么吗？
只有当你能保证你的程序会不断地访问至少2个通道的内存时。这将取决于您的硬件如何将物理内存拆分为内存地址以及您的程序如何分配/访问您的缓冲区。拥有多个 ram 通道不会神奇地使一个 ram 模块的速度翻倍。只有当所有频道都可以同时访问时，您才能获得奖励
内存通道是交错访问的：它们都在顺序访问大块内存时使用。这是一个硬件架构细节，对操作系统活动是透明的。

【解决方案3】：

（从 cmets 继续）

这里有一些内置函数性能测试供参考。它测量保留（通过调用VirtualAlloc）和引入物理RAM（通过调用VirtualLock）40 GB 内存块所需的时间。

#include <sdkddkver.h>
#include <Windows.h>

#include <intrin.h>

#include <array>
#include <iostream>
#include <memory>
#include <fcntl.h>
#include <io.h>
#include <stdio.h>

void
Handle_Error(const ::LPCWSTR psz_what)
{
    const auto error_code{::GetLastError()};
    ::std::array<::WCHAR, 512> buffer;
    const auto format_result
    (
        ::FormatMessageW
        (
            FORMAT_MESSAGE_FROM_SYSTEM
        ,   nullptr
        ,   error_code
        ,   0
        ,   buffer.data()
        ,   static_cast<::DWORD>(buffer.size())
        ,   nullptr
        )
    );
    const auto formatted{0 != format_result};
    if(!formatted)
    {
        const auto & default_message{L"no description"};
        ::memcpy(buffer.data(), default_message, sizeof(default_message));
    }
    buffer.back() = L'\0'; // just in case
    _setmode(_fileno(stdout), _O_U16TEXT);
    ::std::wcout << psz_what << ", error # " << error_code << ": " << buffer.data() << ::std::endl;
    system("pause");
    exit(-1);
}

void
Enable_Previllege(const ::LPCWSTR psz_name)
{
    ::TOKEN_PRIVILEGES tkp{};
    if(FALSE == ::LookupPrivilegeValueW(nullptr, psz_name, ::std::addressof(tkp.Privileges[0].Luid)))
    {
        Handle_Error(L"LookupPrivilegeValueW call failed");
    }
    const auto this_process_handle(::GetCurrentProcess()); // Returns pseudo handle (HANDLE)-1, no need to call CloseHandle
    ::HANDLE token_handle{};
    if(FALSE == ::OpenProcessToken(this_process_handle, TOKEN_ADJUST_PRIVILEGES | TOKEN_QUERY, ::std::addressof(token_handle)))
    {
        Handle_Error(L"OpenProcessToken call failed");
    }
    if(NULL == token_handle)
    {
        Handle_Error(L"OpenProcessToken call returned invalid token handle");
    }
    tkp.PrivilegeCount = 1;
    tkp.Privileges[0].Attributes = SE_PRIVILEGE_ENABLED;
    if(FALSE == ::AdjustTokenPrivileges(token_handle, FALSE, ::std::addressof(tkp), 0, nullptr, nullptr))
    {
        Handle_Error(L"AdjustTokenPrivileges call failed");
    }
    if(FALSE == ::CloseHandle(token_handle))
    {
        Handle_Error(L"CloseHandle call failed");
    }
}

int main()
{
    constexpr const auto bytes_count{::SIZE_T{40} * ::SIZE_T{1024} * ::SIZE_T{1024} * ::SIZE_T{1024}};
    //  Make sure we can set asjust working set size and lock memory.
    Enable_Previllege(SE_INCREASE_QUOTA_NAME);
    Enable_Previllege(SE_LOCK_MEMORY_NAME);
    //  Make sure our working set is sufficient to hold that block + some little extra.
    constexpr const ::SIZE_T working_set_bytes_count{bytes_count + ::SIZE_T{4 * 1024 * 1024}};
    if(FALSE == ::SetProcessWorkingSetSize(::GetCurrentProcess(), working_set_bytes_count, working_set_bytes_count))
    {
        Handle_Error(L"SetProcessWorkingSetSize call failed");
    }
    //  Start timer.
    ::LARGE_INTEGER start_time;
    if(FALSE == ::QueryPerformanceCounter(::std::addressof(start_time)))
    {
        Handle_Error(L"QueryPerformanceCounter call failed");
    }
    //  Run test.
    const ::SIZE_T min_large_page_bytes_count{::GetLargePageMinimum()}; // if 0 then not supported
    const ::DWORD allocation_flags
    {
        (0u != min_large_page_bytes_count)
        ?
        ::DWORD{MEM_COMMIT | MEM_RESERVE} // | MEM_LARGE_PAGES} // need to enable large pages support for current user first
        :
        ::DWORD{MEM_COMMIT | MEM_RESERVE}
    };
    if((0u != min_large_page_bytes_count) && (0u != (bytes_count % min_large_page_bytes_count)))
    {
        Handle_Error(L"bytes_cout value is not suitable for large pages");
    }
    constexpr const ::DWORD protection_flags{PAGE_READWRITE};
    const auto p{::VirtualAlloc(nullptr, bytes_count, allocation_flags, protection_flags)};
    if(!p)
    {
        Handle_Error(L"VirtualAlloc call failed");
    }
    if(FALSE == ::VirtualLock(p, bytes_count))
    {
        Handle_Error(L"VirtualLock call failed");
    }
    //  Stop timer.
    ::LARGE_INTEGER finish_time;
    if(FALSE == ::QueryPerformanceCounter(::std::addressof(finish_time)))
    {
        Handle_Error(L"QueryPerformanceCounter call failed");
    }
    //  Cleanup.
    if(FALSE == ::VirtualUnlock(p, bytes_count))
    {
        Handle_Error(L"VirtualUnlock call failed");
    }
    if(FALSE == ::VirtualFree(p, 0, MEM_RELEASE))
    {
        Handle_Error(L"VirtualFree call failed");
    }
    //  Report results.
    ::LARGE_INTEGER freq;
    if(FALSE == ::QueryPerformanceFrequency(::std::addressof(freq)))
    {
        Handle_Error(L"QueryPerformanceFrequency call failed");
    }
    const auto elapsed_time_ms{((finish_time.QuadPart - start_time.QuadPart) * ::LONGLONG{1000u}) / freq.QuadPart};
    const auto rate_mbytesps{(bytes_count * ::SIZE_T{1000}) / static_cast<::SIZE_T>(elapsed_time_ms)};
    _setmode(_fileno(stdout), _O_U16TEXT);
    ::std::wcout << elapsed_time_ms << " ms " << rate_mbytesps << " MB/s " << ::std::endl;
    system("pause");
    return 0;
}

在我的系统上，Windows 10 Pro，Xeon E3 1245 V5 @ 3.5GHz，64 GB DDR4 (4x16)，它输出：

8188 毫秒 5245441250 MB/s

这段代码似乎只使用了一个内核。 CPU specs 的最大值为 34.1 GB/s。您的第一个代码 sn-p 大约需要 11.5 秒（在发布模式下 VS 不会省略循环）。

启用大页面可能会有所改善。另请注意，VirtualLock 页面无法进行交换，这与手动将它们归零的情况不同。大页面根本不能去交换。

【讨论】：

我使用VirtualAllocExNuma 运行了类似的代码，但是当我使用MEM_COMMIT | MEM_RESERVE | MEM_LARGE_PAGES 而不是MEM_COMMIT | MEM_RESERVE 时，我得到了GetLastError 函数返回的错误1314，这意味着：A required privilege is not held by the client。
对了，5245441250 MB/s 是什么意思？
@GntS 刚才忘了除以兆字节，所以实际上是B/s。错误 1314 很可能意味着您尚未为当前用户启用大页面。
如何启用大页面？
@GntS See this question.