【发布时间】:2021-09-28 10:50:07
【问题描述】:
我正在尝试一些位域操作并按照this post 中的信息对它们进行基准测试。我使用的代码基本相同,如下所示。
我已经编译了代码
❯ g++ bench.cpp -std=c++20 -march=native -O3 -o g++bench.out
❯ clang++ bench.cpp -std=c++20 -march=native -O3 -o clang++bench.out
结果:
❯ ./g++bench.out
operations on struct in memory
bitfields: 0.00443397
570425344
separate ints: 0.00320708
570425344
explicit and/or/shift: 0.0721971
570425344
operations on struct larger than memory
bitfields: 0.202714
570425344
separate ints: 0.127191
570425344
explicit and/or/shift: 0.102186
570425344
❯ ./clang++bench.out
operations on struct in memory
bitfields: 0.00304556
570425344
separate ints: 0.00291514
570425344
explicit and/or/shift: 0.00276303
570425344
operations on struct larger than memory
bitfields: 0.00350051
570425344
separate ints: 0.116294
570425344
explicit and/or/shift: 0.0909704
570425344
主要让我印象深刻的是,大向量中位域的 clang 代码比使用单独的整数或显式和/或/移位的 clang 版本快近 30 倍,比位域的 g++ 编译版本快 58 倍。
由于内存中结构的操作代码都在同一时间运行,我怀疑操作本身没有特殊优化,但 clang 正在做一些聪明的内存获取或循环展开。
谁能解释一下为什么这种情况下的 clang 位域代码如此之快(或者可能只是基准测试中的一个错误)?
我还想知道是否可以调整基准代码,以便 g++ 能够获得相同的加速。
#include <time.h>
#include <iostream>
#include <vector>
struct A
{
void a(unsigned n) { a_ = n; }
void b(unsigned n) { b_ = n; }
void c(unsigned n) { c_ = n; }
void d(unsigned n) { d_ = n; }
unsigned a() { return a_; }
unsigned b() { return b_; }
unsigned c() { return c_; }
unsigned d() { return d_; }
unsigned a_:1,
b_:5,
c_:2,
d_:8;
};
struct B
{
void a(unsigned n) { a_ = n; }
void b(unsigned n) { b_ = n; }
void c(unsigned n) { c_ = n; }
void d(unsigned n) { d_ = n; }
unsigned a() { return a_; }
unsigned b() { return b_; }
unsigned c() { return c_; }
unsigned d() { return d_; }
unsigned a_, b_, c_, d_;
};
struct C
{
void a(unsigned n) { x_ &= ~0x01; x_ |= n; }
void b(unsigned n) { x_ &= ~0x3E; x_ |= n << 1; }
void c(unsigned n) { x_ &= ~0xC0; x_ |= n << 6; }
void d(unsigned n) { x_ &= ~0xFF00; x_ |= n << 8; }
unsigned a() const { return x_ & 0x01; }
unsigned b() const { return (x_ & 0x3E) >> 1; }
unsigned c() const { return (x_ & 0xC0) >> 6; }
unsigned d() const { return (x_ & 0xFF00) >> 8; }
unsigned x_;
};
struct Timer
{
Timer() { get(&start_tp); }
double elapsed() const {
struct timespec end_tp;
get(&end_tp);
return (end_tp.tv_sec - start_tp.tv_sec) +
(1E-9 * end_tp.tv_nsec - 1E-9 * start_tp.tv_nsec);
}
private:
static void get(struct timespec* p_tp) {
if (clock_gettime(CLOCK_REALTIME, p_tp) != 0)
{
std::cerr << "clock_gettime() error\n";
exit(EXIT_FAILURE);
}
}
struct timespec start_tp;
};
template <typename T>
unsigned f()
{
int n = 0;
Timer timer;
T t;
for (int i = 0; i < 1024*1024*32; ++i)
{
t.a(i & 0x01);
t.b(i & 0x1F);
t.c(i & 0x03);
t.d(i & 0xFF);
n += t.a() + t.b() + t.c() + t.d();
}
std::cout << timer.elapsed() << '\n';
return n;
}
template <typename T>
unsigned g()
{
int n = 0;
Timer timer;
std::vector<T> ts(1024 * 1024 * 16);
for (size_t i = 0, idx = 0; i < 1024*1024*32; ++i)
{
T& t = ts[idx];
t.a(i & 0x01);
t.b(i & 0x1F);
t.c(i & 0x03);
t.d(i & 0xFF);
n += t.a() + t.b() + t.c() + t.d();
idx++;
if (idx >= ts.size()) {
idx = 0;
}
}
std::cout << timer.elapsed() << '\n';
return n;
}
int main()
{
std::cout << "operations on struct in memory" << std::endl;
std::cout << "bitfields: " << f<A>() << '\n';
std::cout << "separate ints: " << f<B>() << '\n';
std::cout << "explicit and/or/shift: " << f<C>() << '\n';
std::cout << std::endl;
std::cout << "operations on struct larger than memory" << std::endl;
std::cout << "bitfields: " << g<A>() << '\n';
std::cout << "separate ints: " << g<B>() << '\n';
std::cout << "explicit and/or/shift: " << g<C>() << '\n';
std::cout << std::endl;
}
【问题讨论】:
-
不要为 C++ 问题标记 C。
标签: c++ g++ clang++ bit-fields