优化 Eratosthenes 筛，在布尔数组中添加轮子答案

【问题标题】：Optimizing Sieve of Eratosthenes, Adding Wheel to Bool Array优化 Eratosthenes 筛，在布尔数组中添加轮子
【发布时间】：2018-08-03 09:45:54
【问题描述】：

我知道有一些关于如何实现*的帖子，但我真的很难用我目前的筛子方法来实现一个。

我在 C 中创建了自己的位数组，实现如下：

#define setBit1(Array, i) (Array[i/INT_BITS] |= (1 << i % INT_BITS))
#define getBit(Array, i) ((Array[i/INT_BITS] & (1 << i % INT_BITS)) ? 1 : 0)
#define setBit0(Array, i) (Array[i/INT_BITS] &= ~(1 << i % INT_BITS))

int * createBitArray(unsigned long long size) {

    // find bytes required, round down to nearest whole byte
    unsigned long long bytesRequired = size / BITS_PERBYTE;

    // round up to highest multiple of 4 or 8 bytes (one int) 
    bytesRequired = (sizeof(int) * (bytesRequired / sizeof(int) + 
        ((size % BITS_PERBYTE * sizeof(int) == 0) ? 0 : 1)));

    // allocate array of "bits", round number of ints required up 
    return (int *)malloc((bytesRequired)); 

}

我使用 clock() 在 C 语言中做了一些测试，我发现对于超过 1,000,000 的大数组大小，位数组，即使使用它的位操作，至少比它快 200%一个整数数组。它也使用了 1/32 的内存。

#define indexToNum(n) (2*n + 1) 
#define numToIndex(n) ((n - 1) / 2)
typedef unsigned long long LONG; 

// populates prime array through Sieve of Eratosthenes, taking custom 
// odd keyed bit array, and the raw array length, as arguments
void findPrimes(int * primes, LONG arrLength) {

    long sqrtArrLength = (long)((sqrt((2 * arrLength) + 1) - 1) / 2); 
    long maxMult = 0; 
    long integerFromIndex = 0; 

    for (int i = 1; i <= sqrtArrLength; i++) {

         if (!getBit(primes, i)) {
             integerFromIndex = indexToNum(i); 
             maxMult = (indexToNum(arrLength)) / integerFromIndex; 

             for (int j = integerFromIndex; j <= maxMult; j+= 2) {
                 setBit1(primes, numToIndex((integerFromIndex*j)));     
            }

        }   
    }
}

我一直在用索引 i 填充位数组，该索引表示通过 (2i + 1) 获得的数字。这样做的好处是减少了迭代偶数所花费的任何时间，并再次将数组的必要内存减少了一半。 2 是手动添加到素数之后。这是由于在索引和数字之间转换所花费的时间，但在我的测试中，对于 1,000 多个素数，这种方法更快。

我不知道如何进一步优化；我减少了数组大小，我只测试 sqrt(n)，我开始从 p * p 向上“筛选”素数，我已经消除了所有偶数，我仍然需要大约 60 秒C 中的前 100,000,000 个素数。

据我所知，“wheel”方法要求将数字的实际整数存储在索引中。我真的坚持用我当前的位数组来实现它。

【问题讨论】：

抱歉，*是什么？
Tomas Oliveira y Silva（我可能拼错了）写出了最快的筛子，并在http://sweet.ua.pt/tos/software/prime_sieve.html 上描述了它。
你需要在 malloc() 之后用零填充该数组
数组在 malloc 后隐式初始化为零。一个一个地打印所有位证实了这一点。这是意料之外的……？我正在使用 GCC C99。
*.com/questions/8029584/…

标签： arrays algorithm optimization primes sieve-of-eratosthenes

【解决方案1】：

当我修复您的实现并在我的 Macbook Pro 上运行它时，标记所有复合材料需要 17 秒

不过，您还可以尝试其他一些方法。使用*可能会将您的时间缩短一半。

如果仔细实现，欧拉筛是线性时间的，但它需要一个 int 数组而不是位数组。

阿特金筛需要线性时间，非常实用：https://en.wikipedia.org/wiki/Sieve_of_Atkin

最后是我自己的（这只是意味着我没有在其他任何地方看到它，但我也没有真正看过）超级有趣的筛子也需要线性时间，并在 6.5 秒内找到所有素数

#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <memory.h>
#include <math.h>

#define SETBIT(mem,num) ( ((uint8_t *)mem)[(num)>>4] |= ((uint8_t)1)<<(((num)>>1)&7) )

int main(int argc, char *argv[])
{
    //we'll find all primes <= this
    uint32_t MAXTEST = (1L<<31)-1;
    //which will be less than this many
    size_t MAXPRIMES = 110000000;

    //We'll find this many primes with the sieve of Eratosthenes.
    //After that, we'll switch to a linear time algorithm
    size_t NUMPREMARK = 48;

    //Allocate a bit array for all odd numbers <= MAXTEST
    size_t NBYTES = (MAXTEST>>4)+1;
    uint8_t *mem = malloc(NBYTES);
    memset(mem, 0, NBYTES);

    //We'll store the primes in here
    unsigned *primes = (unsigned *)malloc(MAXPRIMES*sizeof(unsigned));
    size_t nprimes = 0;


    clock_t start_t = clock();

    //first premark with sieve or Eratosthenes
    primes[nprimes++]=2;
    for (uint32_t test=3; nprimes<NUMPREMARK; test+=2)
    {
        if ( mem[test>>4] & ((uint8_t)1<<((test>>1)&7)) )
        {
            continue;
        }
        primes[nprimes++]=test;
        uint32_t inc=test<<1;
        for(uint32_t prod=test*test; prod<=MAXTEST; prod+=inc)
        {
            SETBIT(mem,prod);
        }
    }


    //Iterate through all products of the remaining primes and mark them off, 
    //in linear time. Products are visited in lexical order of their  
    //prime factorizations, with factors in descending order.

    //stacks containing the current prime indexes and partial products for 
    //prefixes of the current product
    size_t stksize=0;
    size_t indexes[64];
    uint32_t products[64];

    for (uint32_t test=primes[NUMPREMARK-1]+2; test<=MAXTEST; test+=2)
    {
        if ( mem[test>>4] & ((uint8_t)1<<((test>>1)&7)) )
        {
            continue;
        }

        //found a prime!  iterate through all products that start with this one
        //They can only contain the same or smaller primes

        //current product
        uint32_t curprod = (uint32_t)test;
        indexes[0] = nprimes;
        products[0] = curprod;
        stksize = 1;

        //remember the found prime (first time through, nprimes == NUMPREMARK)
        primes[nprimes++] = curprod;

        //when we extend the product, we add the first non-premarked prime
        uint32_t extensionPrime = primes[NUMPREMARK];
        //which we can only do if the current product is at most this big
        uint32_t extensionMax = MAXTEST/primes[NUMPREMARK];

        while (curprod <= extensionMax)
        {
            //extend the product with the smallest non-premarked prime
            indexes[stksize]=NUMPREMARK;
            products[stksize++]=(curprod*=extensionPrime);
            SETBIT(mem, curprod);
        }

        for (;;)
        {
            //Can't extend current product.
            //Pop the stacks until we get to a factor we can increase while keeping
            //the factors in descending order and keeping the product small enough
            if (--stksize <= 0)
            {
                //didn't find one
                break;
            }
            if (indexes[stksize]>=indexes[stksize-1])
            {
                //can't increase this factor without breaking descending order
                continue;
            }

            uint64_t testprod=products[stksize-1];
            testprod*=primes[++(indexes[stksize])];
            if (testprod>MAXTEST)
            {
                //can't increase this factor without exceeding our array size
                continue;
            }
            //yay! - we can increment here to the next composite product
            curprod=(uint32_t)testprod;
            products[stksize++] = curprod;
            SETBIT(mem, curprod);

            while (curprod <= extensionMax)
            {
                //extend the product with the smallest non-premarked prime
                indexes[stksize]=NUMPREMARK;
                products[stksize++]=(curprod*=extensionPrime);
                SETBIT(mem, curprod);
            }
        }
    }
    clock_t end_t = clock();
    printf("Found %ld primes\n", nprimes);
    free(mem);
    free(primes);
    printf("Time: %f\n", (double)(end_t - start_t) / CLOCKS_PER_SEC);
}

请注意，我的筛子从比您的优化更好的筛子或 Eratosthenes 开始。主要区别在于我们只为位掩码数组中的奇数分配位。该部分的速度差异并不显着。

【讨论】：

“阿特金……很实用”你是从个人经验说的吗？我已经在链接的 WP 文章中搜索了“实用”这个词，它似乎在说完全相反的内容，具有很强的信念和非常详细的论据。 :) 我也怀疑“欧拉的筛子......需要一个 int 数组而不是一个位数组”。
@WillNess by 'practical'，我的意思是实现不必非常复杂（尽管你可以做很多复杂的优化），你可以为筛子使用位掩码，它是线性时间，并且它支持分段。 W.r.t.欧拉筛，为了实现线性时间，您必须能够在恒定时间内从列表中删除复合材料，并在线性时间内迭代剩余的复合材料。这需要一个不仅仅是位掩码的数据结构。 int 数组是我所知道的最紧凑的“简单”基础。
对于欧拉筛，我们只需要直到上限 sqrt 的素数。所以筛子本身可以是位数组，一个额外的 sqrt 大小的 int 数组会产生最小的影响。至于支持分割，我不这么认为。对于每个下一步，我们都需要上一步的完全互质数。我很想在这里被显示错了。至于阿特金斯，它确实看起来非常复杂，而且有人声称，似乎是有根据的，即使那样它也会输给 2、3、5、...11（或...17）轮式 SoE。
啊，我想我确实是错误的重新位数组。所有互质数都是整数，如果在位数组中保留为非归零条目，访问它们会引入额外的复杂性因素，是的。（另外，我们必须使用重复乘法，这也对速度不利）。分割问题仍然存在。
我对支持分割的评论是针对阿特金的筛子。您说得对，优化可以使 SoE 更快。当你开始交易优化时，你可以来回走动，我认为现在最快的实际实现是 Eratosthenes 筛子：primesieve.org O(N) 和 O(N log log N) 之间的差异不足以主导优化

【解决方案2】：

由于位操作开销，它总是会更慢。

但您可以尝试优化它。

setBit1(Array, i) 可以通过使用所有预移位位的常量数组来改进（我称之为ONE_BIT）

新：
#define setBit1(Array, i) (Array[i/INT_BITS] |= ONE_BIT[ i % INT_BITS])

setBit0(Array, i)也一样
新：
#define setBit0(Array, i) (Array[i/INT_BITS] &= ALL_BUT_ONE_BIT[ i % INT_BITS])

INT_BITS 也很可能是 2 的幂，因此您可以替换
i % INT_BITS
by
i & (INT_BITS-1) // 因为你应该将INT_BITS-1 存储在一个常量中并使用它

这是否可以以及如何加速您的代码，必须通过分析每次更改来检查。

【讨论】：

编译器不会优化模运算符吗？ *.com/questions/20393373/… 我采纳了您关于创建常量数组的建议，我很高兴地说，对于 10000 亿个素数，它以时钟时间衡量的速度略微提高了 9.7%。然而，对于 1,000,000 个素数，它并没有更快。为此，在每个条件下分别收集 10,000,000 和 1,000,000 个素数的 30 和 150 个数据点，然后取平均值。我觉得这有点太认真了。
@Good Rice：取决于你的编译器。 INT_BITS 也需要是一个常量，否则编译器无法优化它。此外，这种 % 优化仅对 unsigned 值有效。正如我所提到的，尝试它并分析结果以检查是否有任何改进。