从 helmgrind 分离 pthread 数据竞争答案

【问题标题】：Detached pthread data race from helgrind从 helmgrind 分离 pthread 数据竞争
【发布时间】：2019-09-10 19:37:01
【问题描述】：

我有一个更大的多线程软件（专有且无法共享）报告来自 helgrind 的数据竞争（请参阅下面的数据竞争）。我不能分享这个软件，但我设计了一些测试来演示比赛。

来自有问题的实际软件的竞赛：

==7746== Possible data race during write of size 1 at 0xAC83697 by thread #4
==7746== Locks held: 2, at addresses 0x583BCD8 0x5846F58
==7746==    at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746==    by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==7746==    by 0x5053CED: get_cached_stack (allocatestack.c:254)
==7746==    by 0x5053CED: allocate_stack (allocatestack.c:501)
==7746==    by 0x5053CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==7746==    by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746==    by 0x40BFA6: <redacted symbol names from private project>
==7746==    by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==7746==    by 0x50536B9: start_thread (pthread_create.c:333)
==7746== 
==7746== This conflicts with a previous write of size 1 by thread #10
==7746== Locks held: none
==7746==    at 0x5053622: start_thread (pthread_create.c:265)
==7746==  Address 0xac83697 is in a rw- anonymous segment
==7746==

当软件关闭一系列线程然后在同一个线程池中重新启动一些新线程时，就会出现这种数据竞争。不幸的是，我无法提供任何代码，但是，我相信我能够重现几个示例来证明该问题。

我发现了与此问题相关的其他 3 个问题：

Why does this recursive pthread_create call result in data race?

上面的答案是手动设置/分配堆栈，我不认为这是一个可行的答案，如果是，有人可以解释为什么吗？

Data race during nested thread creation

答案没有任何作用

Data race with detached pthread detected by valgrind

这个问题没有答案。

编辑：我在这篇文章的底部添加了另一个（不太复杂）示例，它也可以重现问题。

我能够将第一个问题中给出的示例重写为最小可重现的示例，嗯，主要是。

以下代码将在我的机器（Ubuntu 16.04.6 LTS）上运行大约 85% 的时间生成以下数据竞争

运行：

gcc -g ./test.c -o test -lpthread && valgrind --tool=helgrind ./test

==15656== Possible data race during write of size 1 at 0x5C27697 by thread #4
==15656== Locks held: none
==15656==    at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656==    by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==15656==    by 0x4E47CED: get_cached_stack (allocatestack.c:254)
==15656==    by 0x4E47CED: allocate_stack (allocatestack.c:501)
==15656==    by 0x4E47CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==15656==    by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656==    by 0x400832: launch (test3.c:22)
==15656==    by 0x4008FC: threadfn3 (test3.c:48)
==15656==    by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15656==    by 0x4E476B9: start_thread (pthread_create.c:333)
==15656== 
==15656== This conflicts with a previous write of size 1 by thread #2
==15656== Locks held: none
==15656==    at 0x4E47622: start_thread (pthread_create.c:265)
==15656==  Address 0x5c27697 is in a rw- anonymous segment

编辑：我在这篇文章的底部添加了另一个（不太复杂）示例，它也可以重现问题。

这是我为重现问题而构建的程序，信号量不是必需的，但它们似乎大大增加了发生数据竞争的机会。

#include <semaphore.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

pthread_t t1;
pthread_t t2;
pthread_t t3;
pthread_t t4;

void *threadfn1(void *p);
void *threadfn2(void *p);
void *threadfn3(void *p);
void *threadfn4(void *p);

sem_t sem;
sem_t sem2;
sem_t sem3;

void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
    pthread_create(t,NULL,fn,arg);
    pthread_detach(*t);
}

void *threadfn1(void *p)
{
    launch(&t2, threadfn2, NULL);
    printf("1 %p\n", p);
    // notify threadfn3 we are done
    sem_post(&sem);
    return NULL;
}

void *threadfn2(void *p)
{
    launch(&t3, threadfn3, NULL);
    printf("2 %p\n", p);
    // notify threadfn4 we are done
    sem_post(&sem2);
    return NULL;
}

void *threadfn3(void *p)
{
    // wait for threadfn1 to finish
    sem_wait(&sem);
    launch(&t4, threadfn4, NULL);
    // wait for threadfn4 to finish
    sem_wait(&sem3);
    printf("3 %p\n", p);
    return NULL;
}

void *threadfn4(void *p)
{
    // wait for threadfn2 to finish
    sem_wait(&sem2);
    printf("4 %p\n", p);
    // notify threadfn3 we are done
    sem_post(&sem3);
    return NULL;
}

int main()
{
    sem_init(&sem, 0, 0);
    sem_init(&sem2, 0, 0);
    sem_init(&sem3, 0, 0);

    launch(&t1, threadfn1, NULL);
    printf("main\n");
    pthread_exit(NULL);
}

这似乎与在他们的父母或父母的父母结束之前结束的线程有关......最终我无法准确追踪导致数据竞争发生的原因。

另外应该注意的是，在我的测试过程中还出现了几次数据竞赛，最终我无法可靠地重现它，因为它只是偶尔出现，无缘无故。数据竞争与我列出的相同，除了冲突似乎列出了更多的堆栈跟踪而不仅仅是“start_thread”，它看起来与上面第一个问题中报告的数据竞争完全一样，除了它的底部列出 __libc_thread_freeres：

==15973== Possible data race during write of size 1 at 0x5C27697 by thread #4
==15973== Locks held: none
==15973==    at 0x4C3A3CC: mempcpy (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973==    by 0x401375F: _dl_allocate_tls_init (dl-tls.c:515)
==15973==    by 0x4E47CED: get_cached_stack (allocatestack.c:254)
==15973==    by 0x4E47CED: allocate_stack (allocatestack.c:501)
==15973==    by 0x4E47CED: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==15973==    by 0x4C34BB7: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973==    by 0x400832: launch (test3.c:22)
==15973==    by 0x4008FC: threadfn3 (test3.c:48)
==15973==    by 0x4C34DB6: ??? (in /usr/lib/valgrind/vgpreload_helgrind-amd64-linux.so)
==15973==    by 0x4E476B9: start_thread (pthread_create.c:333)
==15973== 
==15973== This conflicts with a previous read of size 1 by thread #2
==15973== Locks held: none
==15973==    at 0x51C10B1: res_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==15973==    by 0x51C1061: __libc_thread_freeres (in /lib/x86_64-linux-gnu/libc-2.19.so)
==15973==    by 0x4E45199: start_thread (pthread_create.c:329)
==15973==    by 0x515547C: clone (clone.S:111)

不，我无法加入线程，这对于我们出现问题的软件不起作用

更新：我一直在做一些测试，并设法生成另一个示例，该示例导致问题的代码少得多。如果您只是启动线程并在循环中分离它们，则会导致数据竞争。

#include <pthread.h>
#include <stdio.h>

// seems we only need 3 threads to cause the problem
#define NUM_THREADS 3

pthread_t t1[NUM_THREADS] = {0};

void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
    pthread_create(t,NULL,fn,arg);
    pthread_detach(*t);
}

void *threadfn(void *p)
{
    return NULL;
}

int main()
{
    int i = NUM_THREADS;
    while (i-- > 0) {
        launch(t1 + i, threadfn, NULL);
    }
    return 0;
}

更新 2：我发现，如果您启动所有线程在分离其中任何一个线程，它似乎会阻止竞争条件出现。请参阅以下不生成竞争条件的代码块：

#include <pthread.h>

#define NUM_THREADS 3

pthread_t t1[NUM_THREADS] = {0};

void launch(pthread_t *t, void *(*fn)(void *), void *arg)
{
    pthread_create(t,NULL,fn,arg);
}

void *threadfn(void *p)
{
    return NULL;
}

int main()
{
    int i;
    for (i = 0; i < NUM_THREADS; ++i) {
        launch(t1 + i, threadfn, NULL);
    }
    for (i = 0; i < NUM_THREADS; ++i) {
        pthread_detach(t1[i]);
    }
    pthread_exit(NULL);
}

如果在任何 pthread_detach() 调用之后添加另一个 pthread_create() 调用，则竞争条件会重新出现。这让我觉得不可能使用 pthread_detach() 并随后使用 pthread_create() 而不会导致数据竞争。

【问题讨论】：

标签： c linux pthreads race-condition

【解决方案1】：

最后，我只是重组了所有内容，以便加入我的线程，我真的不知道分离的线程如何在不引起数据竞争的情况下工作。

【讨论】：