【发布时间】:2017-10-02 09:59:47
【问题描述】:
我想在 RAM 上分配大约 40 GB。我的第一次尝试是:
#include <iostream>
#include <ctime>
int main(int argc, char** argv)
{
unsigned long long ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE]; // 3GB/s 40GB / 13.7 s
unsigned long long i = 0;
const clock_t begintime = clock();
for (i = 0; i < ARRAYSIZE; ++i){
myBuff[i] = 0;
}
std::cout << "finish: " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
std::cin.get();
delete [] myBuff;
return 0;
}
内存写入速度约为 3 GB/s,这对于我的高性能系统来说并不令人满意。
所以我尝试了 Intel Cilk Plus 如下:
/*
nworkers = 5; 8.5 s ==> 4.7 GB/s
nworkers = 8; 8.2 s ==> 4.8 GB/s
nworkers = 10; 9 s ==> 4.5 GB/s
nworkers = 32; 15 s ==> 2.6 GB/s
*/
#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>
int main(int argc, char** argv)
{
unsigned long long ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
if (0 != __cilkrts_set_param("nworkers", "32")){
std::cout << "Error" << std::endl;
}
const clock_t begintime = clock();
cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
myBuff[j] = 0;
}
std::cout << "finish: " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
std::cin.get();
delete [] myBuff;
return 0;
}
结果在代码上方注释。可以看出,nworkers = 8 有加速。 但 nworker 越大,分配越慢。我想可能是由于线程锁定。 所以我尝试了英特尔 TBB 提供的可扩展分配器:
#include "tbb\task_scheduler_init.h"
#include "tbb\blocked_range.h"
#include "tbb\parallel_for.h"
#include "tbb\scalable_allocator.h"
#include "cilk\cilk.h"
#include "cilk\cilk_api.h"
#include <iostream>
#include <ctime>
// No retry loop because we assume that scalable_malloc does
// all it takes to allocate the memory, so calling it repeatedly
// will not improve the situation at all
//
// No use of std::new_handler because it cannot be done in portable
// and thread-safe way (see sidebar)
//
// We throw std::bad_alloc() when scalable_malloc returns NULL
//(we return NULL if it is a no-throw implementation)
void* operator new (size_t size) throw (std::bad_alloc)
{
if (size == 0) size = 1;
if (void* ptr = scalable_malloc(size))
return ptr;
throw std::bad_alloc();
}
void* operator new[](size_t size) throw (std::bad_alloc)
{
return operator new (size);
}
void* operator new (size_t size, const std::nothrow_t&) throw ()
{
if (size == 0) size = 1;
if (void* ptr = scalable_malloc(size))
return ptr;
return NULL;
}
void* operator new[](size_t size, const std::nothrow_t&) throw ()
{
return operator new (size, std::nothrow);
}
void operator delete (void* ptr) throw ()
{
if (ptr != 0) scalable_free(ptr);
}
void operator delete[](void* ptr) throw ()
{
operator delete (ptr);
}
void operator delete (void* ptr, const std::nothrow_t&) throw ()
{
if (ptr != 0) scalable_free(ptr);
}
void operator delete[](void* ptr, const std::nothrow_t&) throw ()
{
operator delete (ptr, std::nothrow);
}
int main(int argc, char** argv)
{
unsigned long long ARRAYSIZE = 20ULL * 1024ULL * 1024ULL * 1024ULL;
tbb::task_scheduler_init tbb_init;
unsigned __int16 *myBuff = new unsigned __int16[ARRAYSIZE];
if (0 != __cilkrts_set_param("nworkers", "10")){
std::cout << "Error" << std::endl;
}
const clock_t begintime = clock();
cilk_for(long long j = 0; j < ARRAYSIZE; ++j){
myBuff[j] = 0;
}
std::cout << "finish: " << float(clock() - begintime) / CLOCKS_PER_SEC << std::endl;
std::cin.get();
delete [] myBuff;
return 0;
}
(以上代码改编自 James Reinders, O'REILLY 的 Intel TBB 书籍) 但结果与之前的尝试几乎相同。我设置了 TBB_VERSION 环境变量,看看我是否真的使用 Scalable_malloc 和得到的信息在这张图片中(nworkers = 32):
https://www.dropbox.com/s/y1vril3f19mkf66/TBB_Info.png?dl=0
我愿意知道我的代码有什么问题。我预计内存写入速度至少约为 40 GB/s。
我应该如何正确使用可扩展分配器?
有人可以提供一个使用可扩展的简单验证示例来自 INTEL TBB 的分配器?
环境: Intel Xeon CPU E5-2690 0 @ 2.90 GHz(2 个处理器),224 GB RAM (2 * 7 * 16 GB) DDR3 1600 MHz,Windows server 2008 R2 Datacenter, Microsoft Visual Studio 2013 和 Intel C++ 编译器 2017。
【问题讨论】:
-
你说性能不理想。是什么让您认为您应该能够写入至少 40GB/s 的速度?
-
根据系统配置。当然,初始化后内存写入速度约为50GB/s。
-
您似乎使用正确,只是您从未释放分配的内存。但是您的问题实际上是一团糟,因为您突然从正确的分配使用切换到内存写入速度期望。更糟糕的是,您试图通过运行一个毫无意义的数组填充循环来测量它,这肯定是eliminated by compilator in release mode。
-
@gnts B 可以代表字节或位。例如,如果您在指定为 50 GB 的系统上测量 5 GB,那么您的容量是理论容量的 80%。我无法确定英特尔在营销中使用了哪些单位。
-
如果您启用了优化器,编译器将在您的第一个示例中丢弃循环。启用优化器!
标签: c++ memory-management tbb scalable