排列数组元素的最快方法答案

【问题标题】：Fastest way to permute elements of an array排列数组元素的最快方法
【发布时间】：2021-01-05 20:28:59
【问题描述】：

我想创建一个函数模板来置换数组的元素（ints 或 uint8_ts，或者其他），其中模板参数确定应用于数组的置换。我主要关心的是性能，所以函数需要尽可能快。以下是我目前的尝试，包含一些性能测试代码。

#include <array>
#include <chrono>
#include <iostream>
#include <stdexcept>

template<typename T, std::size_t MaxSize = 12>
struct Vector{
    std::array<T, MaxSize> m_data{};
    std::size_t m_size{0};

    constexpr Vector() = default;
    constexpr Vector(std::initializer_list<T> list){
        for(auto t : list){
            push_back(t);
        }
    }
    constexpr void pop_back(){
        m_size--;
    }
    constexpr void push_back(T t){
        m_data[m_size] = t;
        m_size++;
    }
    constexpr std::size_t size() const{
        return m_size;
    }
    constexpr T& operator[](int n){
        if(n < 0 || n >= m_size){
            throw std::out_of_range("Vector index is out of bounds");
        }
        return m_data[n];
    }
    constexpr const T& operator[](int n) const{
        if(n < 0 || n >= m_size){
            throw std::out_of_range("Vector index is out of bounds");
        }
        return m_data[n];
    }
};

struct Foo{
    std::array<uint8_t, 8> arr {5,2,4,7,1,3,0,6};

    template<Vector<Vector<int>> cycles>
    void permute1(){
        for(int j=0; j<cycles.size(); j++){
            const Vector<int> &cycle = cycles[j];
            const int size = cycle.size();

            int temp = arr[cycle[0]];
            for(int k=0; k<size-1; k++){
                arr[cycle[k]] = arr[cycle[k+1]];
            }
            arr[cycle[size-1]] = temp;
        }
    }

    template<Vector<Vector<Vector<int>>> cycles>
    void permute2(){
        for(int i=0; i<cycles.size(); i++){
            for(int j=0; j<cycles[i].size(); j++){
                const Vector<int> &cycle = cycles[i][j];
                const int size = cycle.size();

                int temp = arr[cycle[0]];
                for(int k=0; k<size-1; k++){
                    arr[cycle[k]] = arr[cycle[k+1]];
                }
                arr[cycle[size-1]] = temp;
            }
        }
    }

    void permute3(){
        int x = arr[2];
        arr[2] = arr[7];
        arr[7] = arr[6];
        arr[6] = arr[3];
        arr[3] = x;
        x = arr[1];
        arr[1] = arr[5];
        arr[5] = x;
    }
};

int main(){
    constexpr Vector<Vector<int>> v1{{2,7,6,3},{1,5}};
    constexpr Vector<Vector<Vector<int>>> v2{{{2,7,6,3},{1,5}}};

    Foo f;

    auto t1 = std::chrono::high_resolution_clock::now();
    for(int i=0; i<1000000000; i++){
        f.permute1<v1>();
    }
    auto t2 = std::chrono::high_resolution_clock::now();
    for(int i=0; i<1000000000; i++){
        f.permute2<v2>();
    }
    auto t3 = std::chrono::high_resolution_clock::now();
    for(int i=0; i<1000000000; i++){
        f.permute3();
    }
    auto t4 = std::chrono::high_resolution_clock::now();

    auto d1 = std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count();
    auto d2 = std::chrono::duration_cast<std::chrono::microseconds>(t3-t2).count();
    auto d3 = std::chrono::duration_cast<std::chrono::microseconds>(t4-t3).count();

    std::cout << "permute1: " << d1 << " us\n";
    std::cout << "permute2: " << d2 << " us\n";
    std::cout << "permute3: " << d3 << " us\n";

    //use f in some way so the compiler doesn't completely optimise everything away
    return f.arr[2];
}

解释：

permute1 取 Vectors 中的 Vector 对应于置换的不相交循环，并相应地置换 arr。
permute2 做同样的事情，但 Vectors 的 Vector 包含在另一个 Vector 中。我想这样做的原因是因为在我的真实代码中，我的 Foo 结构包含多个需要排列的数组，我想以不同的方式排列它们。
permute3 只是 permute1 和 permute2 所做的硬编码的具体示例。理想情况下，我希望 permute1 和 permute2 编译成与 permute3 编译成相同的汇编代码。

结果：

我正在使用CXX main.cpp -std=c++20 -O3 进行编译，其中CXX 是g++ 或clang++。我正在使用 gcc 10.1 和 clang 主干。以下是该程序的各种运行结果：

上面写的代码的结果：

function |    g++     |  clang++  
----------------------------------
permute1 | 6562221 μs |   44437 μs
permute2 | 6778576 μs | 4760381 μs
permute3 |  601333 μs |   39218 μs

删除了1, 5 循环的结果：

function |    g++     |  clang++  
----------------------------------
permute1 |  616979 μs |   28000 μs
permute2 | 1512795 μs | 2433898 μs
permute3 |  601762 μs |   23977 μs

删除了1, 5 循环并且Foo::arr 是ints 数组而不是uint8_ts 数组的结果：

function |    g++     |  clang++  
----------------------------------
permute1 |  607681 μs |   32471 μs
permute2 |  997331 μs | 2431138 μs
permute3 |  624083 μs |   24020 μs

三个函数的结果有__attribute__ ((always_inline))

function |    g++     |  clang++  
----------------------------------
permute1 | 6545779 μs |   43103 μs
permute2 | 6799692 μs |  919098 μs
permute3 |  603414 μs |   24288 μs

上面编写的代码的结果，使用配置文件引导优化编译

function |    g++     |  clang++  
----------------------------------
permute1 | 5478490 μs |   43818 μs
permute2 | 7274339 μs | 5398051 μs
permute3 |  731976 μs |   44324 μs

问题：

为什么其中一些功能之间的速度差异如此之大？我本来希望他们都差不多。特别是，我不明白为什么 permute1 和 permute2 完全不同。
如何修改permute2 使其与permute3 一样快（最好编译成相同的汇编代码）？

【问题讨论】：

FWIW，标准库已经有一个工具：std::next_permutation
@NathanOliver 但我没有计算下一个排列。我正在对数组应用固定排列，例如循环 4 个元素。
您是否尝试过强制循环一直展开？ Clang 拼写为#pragma clang loop unroll(full)。更好的是：用模板编写循环，而不是 fors。
@HTNW 我刚刚测试了#pragma clang loop unroll(full)，它没有任何区别。除非绝对必要，否则我不想使用模板循环，因为它们非常丑陋。
@Ben Odd，将其中一个贴在permute2 的外环上可以让我（在 Godbolt、Clang 主干上）加速 5 倍。但它还不够聪明，无法弄清楚最里面的循环要做什么......

标签： c++ performance optimization permutation

【解决方案1】：

你的循环是运行时的，所以你失去了constexpr，使用相当于 constexpr 循环似乎甚至可以更好地优化，即使是手写的：

template<Vector<Vector<int>> cycles>
void permute4(){
    auto loop2 = [&]<std::size_t ...Js>(std::index_sequence<Js...>, auto I)
    {
        constexpr Vector<int> cycle = cycles[I];

        int temp = arr[cycle[0]];
        ((arr[cycle[Js]] = arr[cycle[Js+1]]), ...);
        arr[cycle[sizeof...(Js)]] = temp;
    };

    [&]<std::size_t ...Is>(std::index_sequence<Is...>)
    {
        (loop2(std::make_index_sequence<cycles[Is].size() - 1>(),
               std::integral_constant<std::size_t, Is>{}),
        ...);
    }(std::make_index_sequence<cycles.size()>());
}

Demo（确保行为相同）

Benchmark

【讨论】：

这真是令人印象深刻，无需保留我的答案。