使第一个质数达到极限的最快方法答案

【问题标题】：Fastest way to get the first prime numbers up to a limit使第一个质数达到极限的最快方法
【发布时间】：2020-08-05 17:36:51
【问题描述】：

就 CPU 而言，使第一个质数达到极限的最快方法是什么？

【问题讨论】：

这能回答你的问题吗？ Fastest way to list all primes below N
@PresidentJamesK.Polk 不，因为这个问题不仅限于 Python
答案也不限于python，因为它们很容易翻译成Java和其他语言。
@PresidentJamesK.Polk נראה לי שאתה תעדיף לקרוא בשפה שאתה מבין。 בליתרגומים
好吧，好吧，我放弃，你赢了:)

标签： numbers primes prime-factoring

【解决方案1】：

在我的机器上计算前 1,000,000,000 个数字需要 4.8 秒。比从文件中读取还要快！

    public static void main(String[] args) {
        long startingTime = System.nanoTime();
        
        int limit = 1_000_000_000;
        BitSet bitSet = findPrimes(limit);

        System.out.println(2);
        System.out.println(3);
        System.out.println(5);
        System.out.println(7);
        limit = limit / 2 + 1;
        for (int i = 6; i < limit; i++) {
            if (!bitSet.get(i)) {
                int p = i * 2 - 1;
                //System.out.println(p);
            }
        }

        System.out.println("Done in " + (System.nanoTime() - startingTime) / 1_000_000 + " milliseconds");
    }

    public static BitSet findPrimes(int limit) {
        BitSet bitSet = new BitSet(limit / 2 + 1);
        int size = (int) Math.sqrt(limit);
        size += size % 2;
        for (int prime = 3; prime < size; prime += 2) {
            if (bitSet.get(prime / 2 + 1)) {
                continue;
            }
            for (int i = prime / 2 + 1; i < bitSet.size(); i += prime) {
                bitSet.set(i);
            }
        }

        return bitSet;
    }

计算素值最有效的方法是使用Sieve of Eratosthenes，它最早是在几千年前的古希腊发现的。

然而，问题在于它需要大量内存。当我声明一个 1,000,000,000 的布尔数组时，我的 Java 应用程序简直崩溃了。

所以我做了两个内存优化来让它工作

由于 2 之后的所有素数都是奇数，我通过将索引映射到 int p = i * 2 - 1; 将数组大小减半
然后，我没有使用布尔数组，而是使用了BitSet，它可以对长数组进行位操作。

通过使用 Eratosthenes 筛进行的这两项优化，您可以在 4.8 秒内计算 btSet。至于打印出来，那就另当别论了；）

【讨论】：

我只是好奇，bitset到底是做什么的？我希望在 python 中实现它，因为它很棒，我只需要更多关于数学运算 bitset 正在做什么的信息。提前致谢
根本不需要太多内存。使用分段筛。每个段的大小为 sqrt(N)。
不是 1B [素数] 数字，而是最高 1B 的素数。低于 51M 的素数。
@WillNess 我很想看到你实现它。我很难相信即使理论上它应该具有相同的性能。
理论上它必须更快，因为缓存一致性，当段大小适合 L1（或 L2 甚至，不知道）。 stackoverflow.com/a/10249801/849891，stackoverflow.com/search?q=user%3A849891+offset+sieve，stackoverflow.com/search?tab=votes&q=user%3a849891%20sieve（顺便说一句，C++ 中的简单非分段筛：ideone.com/fapob）

【解决方案2】：

有多种方法可以快速计算素数——“最快”最终将取决于许多因素（硬件、算法、数据格式等的组合）。这已经在这里得到了回答：Fastest way to list all primes below N。然而，这是针对 Python 的，当我们考虑 C（或其他低级语言）时，这个问题变得更加有趣

你可以看到：

渐近地，阿特金筛具有更好的计算复杂度（O(N) 或更低* - 见注释），但在实践中，埃拉托色尼筛 (O(N log log N)) 非常好，并且它具有极易实施的优势：

/* Calculate 'isprime[i]' to tell whether 'i' is prime, for all 0 <= i < N */
void primes(int N, bool* isprime) {
    int i, j;
    /* first, set odd numbers to prime, evens to not-prime */
    for (i = 0; i < N; ++i) isprime[i] = i % 2 == 1;
    
    /* special cases for i<3 */
    isprime[0] = isprime[1] = false;

    /* Compute square root via newton's algorithm 
     * (this can be replaced by 'i * i <= N' in the foor loop)
     */
    int sqrtN = (N >> 10) + 1024;
    for (i = 0; i < 32; ++i) sqrtN = (N / sqrtN + sqrtN) / 2;

    /* iterate through all odd numbers */
    isprime[2] = true;
    for (i = 3; i <= sqrtN /* or: i*i <= N */; i += 2) {
        /* check if this odd number is still prime (i.e. if it has not been marked off) */
        if (isprime[i]) {

            /* mark off multiples of the prime */
            j = 2 * i;
            isprime[j] = false;
            for (j += i; j < N; j += 2 * i) {
                isprime[j] = false;
            }
        }
    }
}

该代码非常清晰简洁，但需要大约 6 秒来计算质数 N * sizeof(bool) == N == 1000000000 字节的内存。不太好

我们可以使用 bit-fiddling 和 uint64_t 类型（使用 #include <stdint.h>）来生成以相同复杂度运行的代码，但希望更快并且使用更少的内存。我们可以立即将内存减少到 N/8（每个布尔值 1 位而不是 1 字节）。此外，我们只能计算奇数并减少更多内存 - 2 倍，使我们的最终使用量达到N/16 字节。

代码如下：


/* Calculate that the (i//2)%64'th bit of 'isprime[(i//2)/64]' to tell whether 'i' is prime, for all 0 <= i < N,
 *   with 'i % 2 == 0'
 * 
 * Since the array 'isprime' can be seen as a long bistring:
 * 
 * isprime:
 * [0]       [1]          ...
 * +---------+---------+
 * |01234....|01234....|  ...
 * +---------+---------+
 * 
 * Where each block is a 64 bit integer. Therefore, we can map the odd natural numbers to this with the following formula:
 * i -> index=((i / 2) / 64), bit=((i / 2) % 64)
 * 
 * So, '1' is located at 'primes[0][0]' (where the second index represnts the bit),
 *     '3' is at 'primes[0][1]', 5 at 'primes[0][2]', and finally, 129 is at 'primes[1][0]'
 * 
 * And so forth.
 * 
 * Then, we use that value as a boolean to indicate whether the 'i' that maps to it is prime
 * 
 */
void primes_bs(int N, uint64_t* isprime) {
    uint64_t i, j, b/* block */, c/* bit */, v, ii;
    /* length of the array is roundup(N / 128) */
    int len = (N + 127) / 128;
    /* set all to isprime */
    for (i = 0; i < len; ++i) isprime[i] = ~0;
    
    /* set i==1 to not prime */
    isprime[0] &= ~1ULL;

    /* Compute square root via newton's algorithm */
    int sqrtN = (N >> 10) + 1024;
    for (i = 0; i < 32; ++i) sqrtN = (N / sqrtN + sqrtN) / 2;

    /* Iterate through every word/chunk and handle its bits */
    uint64_t chunk;
    for (b = 0; b <= (sqrtN + 127) / 128; ++b) {
        chunk = isprime[b];
        c = 0;
        while (chunk && c < 64) {
            if (chunk & 1ULL) {
                /* hot bit, so is prime */
                i = 128 * b + 2 * c + 1;
                /* iterate 3i, 5i, 7i, etc... 
                 * BUT, j is the index, so is basically 3i/2, 5i/2, 7i/2,
                 *   that way we don't need as many divisions per iteration,
                 *   and we can use 'j += i' (since the index only needs 'i'
                 *   added to it, whereas the value needs '2i')
                 */
                for (j = 3 * i / 2; j < N / 2; j += i) {
                    isprime[j / 64] &= ~(1ULL << (j % 64));
                }
            }

            /* chunk will not modify itself */
            c++;
            chunk = isprime[b] >> c;
        }
    }

    /* set extra bits to 0 -- unused */
    if (N % 128 != 0) isprime[len - 1] &= (1ULL << ((N / 2) % 64)) - 1;
}

这段代码在我的机器上大约 2.7 秒内计算出所有小于 1000000000 (1_000_000_000) 的数字的素数，这是一个很大的改进！而且您使用的内存要少得多（是其他方法使用的内存的 1/16。

但是，如果您在其他代码中使用它，界面会比较棘手——您可以使用函数来提供帮助：

/* Calculate if 'x' is prime from a dense bitset */
bool calc_is_prime(uint64_t* isprime, uint64_t x) {
  /**/ if (x == 2) return true; /* must be included, since 'isprime' only encodes odd numbers */
  else if (x % 2 == 0) return false;
  return (isprime[(x / 2) / 64] >> ((x / 2) % 64))) & 1ULL;
}

您可以更进一步，将这种映射索引的方法推广到仅奇数值 (https://en.wikipedia.org/wiki/Wheel_factorization)，但生成的代码可能会更慢——因为内部循环和操作不再是 2 的幂，您要么必须查找表格，要么必须使用一些非常聪明的技巧。您想要使用轮分解的唯一情况是如果您的内存非常有限，在这种情况下，primorial(3) == 30 的轮分解将允许您为每 7 位存储 30 个数字（而不是每 1 位存储 2 个数字）。或者，比如primorial(5) == 2310，您可以为每 341 个存储 2310 个数字

然而，大轮子分解将再次将一些指令从位移和二的幂运算更改为表查找，对于大轮子，这实际上可能最终会慢得多

注意事项：

从技术上讲，存在 O(N / (log log N)) 的阿特金筛网版本，但您不再使用相同的密集数组（如果是，复杂度必须至少为 O(N)）

【讨论】：