【问题标题】:Parallelization: bad results with threads, good results with processes. Why?并行化:线程的坏结果,进程的好结果。为什么?
【发布时间】:2013-05-04 22:37:57
【问题描述】:

我遇到了一个问题,C 程序与线程的并行化并不能真正提高速度,而与进程的并行化实际上可以。我真的不明白为什么,所以也许有人可以解释一下。这里有两个程序,都计算平方根大约 10.000.000 次。首先是线程:

//clang  threads.c -Wall -O3 -o with_threads

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <pthread.h>

#define ENTRIES 10485760
#define THREADS 8

int threads_no[THREADS];
int current = 0;

void* squareroot(void* offset) {
  int foo = ENTRIES / current;
  float *a = malloc(sizeof(float)*foo);

  for (int i = 0; i < ENTRIES / current; i++)
    a[i] = i + 1;

  clock_t s0 = clock();

  int i = 0;
  while (i < ENTRIES / current) {
    a[i] = sqrtf(a[i]);
    ++i;
  }
  printf("Thread %d spent %f calculating %d entries\n", *(int*)offset, ((double)(clock() - s0) / CLOCKS_PER_SEC), i);
  return NULL;
}

int main() {

  for (int t = 0; t < THREADS; t++)
    threads_no[t] = t;

  while (++current <= THREADS) {
    printf("With %d threads...\n", current);

    pthread_t threads[current];

    for (int t = 0; t < current; t++)
      pthread_create(&threads[t], NULL, squareroot, &threads_no[t]);

    for (int t = 0; t < current; t++)
      pthread_join(threads[t], NULL);
  }
  return 0;
}

...以及相应的带有进程的代码:

//clang  procs.c -Wall -O3 -o with_procs

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <time.h>
#include <math.h>

#define ENTRIES 10485760
#define PROCS 8

int procs[PROCS];
int current = 0;

void* squareroot(void* offset) {
  int foo = ENTRIES / current;
  float *a = malloc(sizeof(float)*foo);

  for (int i = 0; i < ENTRIES / current; i++)
    a[i] = i + 1;

  clock_t s0 = clock();

  int i = 0;
  while (i < ENTRIES / current) {
    a[i] = sqrtf(a[i]);
    ++i;
  }
  printf("Process %d spent %f calculating %d entries\n", *(int*)offset, ((double)(clock() - s0) / CLOCKS_PER_SEC), i);
  return NULL;
}

int main() {

  for (int t = 0; t < PROCS; t++)
    procs[t] = t;

  printf("Single:\n");
  current = 1;
  squareroot(&procs[0]);
  printf("Parallel:\n");
  current = 0;

  while (++current <= PROCS) {
    printf("Wiht %d procs...\n", current);

    for (int i = 0, pid = 0; i < current; i++) {
      pid = fork();
      if (pid < 0) {
        printf("Error");
        exit(1);
      } else if (pid == 0) {
        squareroot(&procs[i]);
        exit(0); 
      }
    }
    for (int i = 0; i < current; i++)
      wait(NULL);
  }
  return 0;
}

在我的机器(MacBook Air Core i5 1,7)上,线程的结果是:

With 1 threads...
Thread 0 spent 0.030546 calculating 10485760 entries
With 2 threads...
Thread 1 spent 0.032468 calculating 5242880 entries
Thread 0 spent 0.037332 calculating 5242880 entries
With 3 threads...
Thread 0 spent 0.015804 calculating 3495253 entries
Thread 1 spent 0.026870 calculating 3495253 entries
Thread 2 spent 0.029845 calculating 3495253 entries
With 4 threads...
Thread 3 spent 0.037240 calculating 2621440 entries
Thread 0 spent 0.052195 calculating 2621440 entries
Thread 1 spent 0.056285 calculating 2621440 entries
Thread 2 spent 0.054233 calculating 2621440 entries
With 5 threads...
Thread 1 spent 0.026005 calculating 2097152 entries
Thread 3 spent 0.031361 calculating 2097152 entries
Thread 4 spent 0.041360 calculating 2097152 entries
Thread 2 spent 0.054898 calculating 2097152 entries
Thread 0 spent 0.034579 calculating 2097152 entries
With 6 threads...
Thread 2 spent 0.026277 calculating 1747626 entries
Thread 4 spent 0.029041 calculating 1747626 entries
Thread 1 spent 0.028271 calculating 1747626 entries
Thread 3 spent 0.018770 calculating 1747626 entries
Thread 5 spent 0.043817 calculating 1747626 entries
Thread 0 spent 0.019002 calculating 1747626 entries
With 7 threads...
Thread 0 spent 0.022857 calculating 1497965 entries
Thread 3 spent 0.050611 calculating 1497965 entries
Thread 5 spent 0.015109 calculating 1497965 entries
Thread 4 spent 0.028377 calculating 1497965 entries
Thread 1 spent 0.043619 calculating 1497965 entries
Thread 2 spent 0.071591 calculating 1497965 entries
Thread 6 spent 0.022199 calculating 1497965 entries
With 8 threads...
Thread 2 spent 0.039933 calculating 1310720 entries
Thread 5 spent 0.021614 calculating 1310720 entries
Thread 7 spent 0.062763 calculating 1310720 entries
Thread 3 spent 0.041014 calculating 1310720 entries
Thread 0 spent 0.033286 calculating 1310720 entries
Thread 6 spent 0.044050 calculating 1310720 entries
Thread 4 spent 0.082030 calculating 1310720 entries
Thread 1 spent 0.016579 calculating 1310720 entries

对于进程:

Single:
Process 0 spent 0.030531 calculating 10485760 entries
Parallel:
Wiht 1 procs...
Process 0 spent 0.030548 calculating 10485760 entries
Wiht 2 procs...
Process 0 spent 0.015946 calculating 5242880 entries
Process 1 spent 0.015995 calculating 5242880 entries
Wiht 3 procs...
Process 1 spent 0.012040 calculating 3495253 entries
Process 0 spent 0.014993 calculating 3495253 entries
Process 2 spent 0.016536 calculating 3495253 entries
Wiht 4 procs...
Process 1 spent 0.009256 calculating 2621440 entries
Process 2 spent 0.011725 calculating 2621440 entries
Process 0 spent 0.008604 calculating 2621440 entries
Process 3 spent 0.011057 calculating 2621440 entries
Wiht 5 procs...
Process 0 spent 0.007498 calculating 2097152 entries
Process 1 spent 0.008804 calculating 2097152 entries
Process 4 spent 0.008814 calculating 2097152 entries
Process 3 spent 0.010208 calculating 2097152 entries
Process 2 spent 0.009060 calculating 2097152 entries
Wiht 6 procs...
Process 1 spent 0.005633 calculating 1747626 entries
Process 2 spent 0.005553 calculating 1747626 entries
Process 0 spent 0.005950 calculating 1747626 entries
Process 4 spent 0.005977 calculating 1747626 entries
Process 3 spent 0.009157 calculating 1747626 entries
Process 5 spent 0.009563 calculating 1747626 entries
Wiht 7 procs...
Process 4 spent 0.005060 calculating 1497965 entries
Process 0 spent 0.005710 calculating 1497965 entries
Process 1 spent 0.004703 calculating 1497965 entries
Process 3 spent 0.005091 calculating 1497965 entries
Process 6 spent 0.007243 calculating 1497965 entries
Process 5 spent 0.004760 calculating 1497965 entries
Process 2 spent 0.005729 calculating 1497965 entries
Wiht 8 procs...
Process 0 spent 0.005995 calculating 1310720 entries
Process 1 spent 0.004285 calculating 1310720 entries
Process 2 spent 0.006809 calculating 1310720 entries
Process 7 spent 0.005404 calculating 1310720 entries
Process 3 spent 0.005978 calculating 1310720 entries
Process 5 spent 0.004108 calculating 1310720 entries
Process 6 spent 0.005336 calculating 1310720 entries
Process 4 spent 0.005409 calculating 1310720 entries

对于线程,总是至少有一个线程需要与单次运行一样长的时间,因此没有任何改进。流程似乎更加平衡。我没有对线程使用任何同步原语,因为它们不是必需的。有人可以解释为什么它们如此不同吗?我在谷歌上搜索了很长时间都没有运气。

提前致谢。

更新:在考虑到 cmets 之后,使用 gettimeofday/2 测量时间,线程实现实际上似乎是正确的。供参考:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <pthread.h>
#include <sys/time.h>

#define ENTRIES 10485760
#define THREADS 8

int threads_no[THREADS];
int current = 0;

void* squareroot(void* offset) {
  int foo = ENTRIES / current;
  float *a = malloc(sizeof(float)*foo);

  for (int i = 0; i < ENTRIES / current; i++)
    a[i] = i + 1;

  clock_t s0 = clock();

  int i = 0;
  while (i < ENTRIES / current) {
    a[i] = sqrtf(a[i]);
    ++i;
  }
  // printf("Thread %d spent %f calculating %d entries\n", *(int*)offset, ((double)(clock() - s0) / CLOCKS_PER_SEC), i);
  return NULL;
}

int main() {

  for (int t = 0; t < THREADS; t++)
    threads_no[t] = t;

  struct timeval t1, t2;
  double elapsedTime;

  // start timer


  while (++current <= THREADS) {
    printf("With %d threads... ", current);
    gettimeofday(&t1, NULL);
    pthread_t threads[current];

    for (int t = 0; t < current; t++)
      pthread_create(&threads[t], NULL, squareroot, &threads_no[t]);

    for (int t = 0; t < current; t++)
      pthread_join(threads[t], NULL);
    gettimeofday(&t2, NULL);
    elapsedTime = (t2.tv_sec - t1.tv_sec) * 1000.0;      // sec to ms
    elapsedTime += (t2.tv_usec - t1.tv_usec) / 1000.0;   // us to ms
    printf("%f\n", elapsedTime);
  }
  return 0;
}

最好, 马丁

【问题讨论】:

    标签: c multithreading performance parallel-processing


    【解决方案1】:

    clock 测量进程时间,而不是线程时间。它对于测量单个线程的性能是无用的。

    【讨论】:

    • 谢谢,主要的见解。无论我是跨线程还是跨进程拆分工作,什么是衡量性能的可靠方法?
    • 我知道一个适用于 Linux (man clock_gettime) 但不适用于 Mac OS X。
    • 我可以访问 linux 系统,所以我会试一试。谢谢!
    【解决方案2】:

    我认为这可能与clock() 通话有关。 在我的系统中(没有 -O3 和 8 倍的数据)我得到了以下信息:

    With 1 threads...
    Thread 0 spent 2.390000 calculating 83886080 entries
    With 2 threads...
    Thread 0 spent 2.390000 calculating 41943040 entries
    Thread 1 spent 2.380000 calculating 41943040 entries
    With 3 threads...
    Thread 0 spent 2.380000 calculating 27962026 entries
    Thread 1 spent 2.370000 calculating 27962026 entries
    Thread 2 spent 2.370000 calculating 27962026 entries
    With 4 threads...
    Thread 0 spent 2.370000 calculating 20971520 entries
    Thread 2 spent 2.380000 calculating 20971520 entries
    Thread 3 spent 2.260000 calculating 20971520 entries
    ...
    With 7 threads...
    Thread 1 spent 2.370000 calculating 11983725 entries
    Thread 4 spent 2.340000 calculating 11983725 entries
    Thread 0 spent 2.340000 calculating 11983725 entries
    Thread 6 spent 2.340000 calculating 11983725 entries
    ....
    With 8 threads...
    Thread 1 spent 2.320000 calculating 10485760 entries
    Thread 0 spent 2.330000 calculating 10485760 entries
    Thread 5 spent 2.350000 calculating 10485760 entries
    ....
    Thread 3 spent 2.060000 calculating 10485760 entries
    

    现在,查看 clock() 手册页,上面写着:

    On several other implementations, the value returned by clock() also includes
    the times of any children whose status has been collected via wait(2) (or 
    another wait-type call).
    Linux does not include the times of waited-for children in the value returned by
    clock(). The times(2) function, which explicitly returns (separate) information 
    about the caller and its children, may be preferable.
    

    所以可能是时间相关的问题?

    附:在我的测试中,加速非常明显。

    【讨论】:

    • 感谢您进行基准测试!与 n.m. 的答案相同。我将研究如何适当地测量线程之间的时间。
    【解决方案3】:

    您真的打算让“当前”成为全球性的吗?当其他线程正在使用它时,您正在对其进行变异。

    【讨论】:

    • 这只是调试线程和进程不同行为的代码。但是,current 只是在线程产生之前和创建之后才被修改,因此它应该没有效果。
    • 嗯,对 pthread 函数的屏障语义做了一些合理的假设,是的。我想我是因为之前在手机上阅读而误读了。
    猜你喜欢
    • 2021-10-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多