Eratosthenes 的平行筛根据线程数产生错误的输出答案

【问题标题】：Parallel sieve of eratosthenes produces wrong output based on number of threadsEratosthenes 的平行筛根据线程数产生错误的输出
【发布时间】：2021-10-29 08:00:22
【问题描述】：

这是我用于并行化的方法：

p = 进程 ID，N = 进程总数，n = 输入大小
每个进程都分配有n/N 编号，范围从索引p*n/N 到(p+1)*n/N-1。
当进程 0 找到一个素数时，它的所有倍数都被并行标记。
最后，每个进程计算其分配范围内的素数数量并将其添加到全局计数中。

使用 C++/OpenMP 进行并行化。在调用函数之前，我用omp_set_num_threads()设置了最大线程数。

// p is thread ID, N is total number of threads, n is the array size
#define LOWER_BOUND(p, N, n) p == 0 ? 2 : (p * n / N)
#define UPPER_BOUND(p, N, n) p == (N - 1) ? (n - 1) : (p + 1) * (n / N) - 1

size_t parallel_soe(std::vector<std::atomic<bool>>& A) {
    size_t n = A.size();
    size_t sqrt_n = sqrt(n);
    for (size_t i = 2; i <= sqrt_n; i++) {
        if (A[i] == true) continue;
        #pragma omp parallel
        {
            uint16_t p = omp_get_thread_num();
            uint16_t N = omp_get_num_threads();
            size_t lower_bound = LOWER_BOUND(p, N, n);
            size_t upper_bound = UPPER_BOUND(p, N, n);
            size_t smallest_multiple = std::max(i * i, lower_bound);
            size_t remainder = smallest_multiple % i;
            if (remainder) smallest_multiple += i - remainder;
            for (size_t j = smallest_multiple; j <= upper_bound; j += i)
                A[j] = true;
        }
    }
    // Count number of primes
    size_t prime_count = 0;
    #pragma omp parallel
    {
        uint16_t p = omp_get_thread_num();
        uint16_t N = omp_get_num_threads();
        size_t lower_bound = LOWER_BOUND(p, N, n);
        size_t upper_bound = UPPER_BOUND(p, N, n);
        size_t count = 0;
        for (size_t i = lower_bound; i <= upper_bound; i++)
            if (A[i] == false)
                count++;
        #pragma omp atomic
        prime_count += count;
    }
    return prime_count;
}

当最大线程数设置为 2、4、8、10 和 16 时，该函数返回的值是正确的，但对于 6、12 和 14 则不正确。我在 4 核 Intel i5 上运行它。

这是我的输出日志：

Finding primes under: 10000
================================
[2-parallel]  Found 1229 primes in 542 microseconds
[4-parallel]  Found 1229 primes in 3173 microseconds
[6-parallel]  Found 1228 primes in 2353 microseconds
[8-parallel]  Found 1229 primes in 3600 microseconds
[10-parallel] Found 1229 primes in 3227 microseconds
[12-parallel] Found 1227 primes in 2778 microseconds
[14-parallel] Found 1226 primes in 2248 microseconds
[16-parallel] Found 1229 primes in 2320 microseconds

Finding primes under: 100000
================================
[2-parallel]  Found 9592 primes in 5186 microseconds
[4-parallel]  Found 9592 primes in 10351 microseconds
[6-parallel]  Found 9591 primes in 9859 microseconds
[8-parallel]  Found 9592 primes in 8500 microseconds
[10-parallel] Found 9592 primes in 12294 microseconds
[12-parallel] Found 9591 primes in 8300 microseconds
[14-parallel] Found 9582 primes in 9252 microseconds
[16-parallel] Found 9592 primes in 8557 microseconds

Finding primes under: 1000000
================================
[2-parallel]  Found 78498 primes in 36091 microseconds
[4-parallel]  Found 78498 primes in 43570 microseconds
[6-parallel]  Found 78498 primes in 48176 microseconds
[8-parallel]  Found 78498 primes in 44201 microseconds
[10-parallel] Found 78498 primes in 43645 microseconds
[12-parallel] Found 78498 primes in 49175 microseconds
[14-parallel] Found 78494 primes in 47411 microseconds
[16-parallel] Found 78498 primes in 53602 microseconds

【问题讨论】：

您有 2 个分开的 #pragma omp parallel。您可以测试哪一个是有问题的。
筛子不适合很容易平行。你的代码很复杂，显然容易出错，难以维护，而且你投入的线程越多，性能就越差。最重要的是，即使您的最佳性能似乎也比 naive single-threaded implementation 慢 10 倍。标记非素数所涉及的工作太少，不足以证明所有线程同步的开销。
你需要一个质数的线程
@paddy 我只显示了较小数字的输出（低于 10^6），因此开销当然是不合理的。您会推荐哪些优化？
它不能解决代码问题，但在各种编码标准中不鼓励将类似函数的行为放在宏中（请参阅this SonarSource 规则中的参考部分）。编译器可能会同时内联LOWER_BOUND 和UPPER_BOUND，因为它们可以作为单个语句实现。

标签： c++ openmp sieve-of-eratosthenes

【解决方案1】：

我想通了。它与并行执行无关，我只是在计算边界时出现了计算错误。

这是计算下限的正确公式：

#define LOWER_BOUND(p, N, n) p == 0 ? 2 : p * (n / N)

注意n/n 周围的括号。

【讨论】：