【发布时间】:2023-12-10 17:21:02
【问题描述】:
我是一名业余程序员,正在尝试使用 pthreads,看看多线程程序可以在多大程度上提高我正在处理的相当长的计算的效率。计算通过一个 std::list 对象运行,弹出列表的第一个元素,并将其分配给一个线程,用它计算一些东西。该程序跟踪活动线程,并确保始终有一定数量的活动线程在运行。一旦列表为空,程序对结果数据进行排序,转储数据文件并终止。
程序的多线程版本目前不工作。它在列表中获得 20 或 40 或 200 个左右的元素(取决于我给它的列表)和段错误。似乎段错误发生在列表的特定元素上,这意味着它们不会以任何方式随机出现。
但是 奇怪的是,如果我用调试符号编译并通过 gdb 运行程序,程序不会出现段错误。它运行完美。当然,慢慢地,但它运行并按照我期望的方式完成所有事情。
在考虑了大家的建议一段时间后,使用(除其他外)valgrind 的工具来尝试解决正在发生的事情。我注意到下面的简化代码(在 std 库或 pthread 库之外没有任何调用)会给 helgrind 带来麻烦,这可能是我问题的根源。所以这里只是简化的代码,以及 hlgrind 的抱怨。
#include <cstdlib>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string>
#include <list>
#include <iostream>
#include <signal.h>
#include <sys/select.h>
struct thread_detail {
pthread_t *threadID;
unsigned long num;
};
pthread_mutex_t coutLock;
void *ThreadToSpawn(void *threadarg)
{
struct thread_detail *my_data;
my_data = (struct thread_detail *) threadarg;
int taskid = my_data->num;
struct timeval timeout;
for (unsigned long i=0; i < 10; i++)
{
timeout.tv_sec = 0; timeout.tv_usec = 500000; // half-second
select( 0, NULL, NULL, NULL, & timeout );
pthread_mutex_lock(&coutLock);
std::cout << taskid << " "; std::cout.flush();
pthread_mutex_unlock(&coutLock);
}
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
unsigned long comp_DONE=0;
unsigned long comp_START=0;
unsigned long ms_LAG=10000; // microsecond lag between polling of threads
// set-up the mutexes
pthread_mutex_init( &coutLock, NULL );
if (argc != 3) { std::cout << "Program requires two arguments: (1) number of threads to use,"
" and (2) tasks to accomplish. \n"; exit(1); }
unsigned long NUM_THREADS(atoi( argv[1] ));
unsigned long comp_TODO(atoi(argv[2]));
std::cout << "Program will have " << NUM_THREADS << " threads. \n";
std::list < thread_detail > thread_table;
while (comp_DONE != comp_TODO) // main loop to set-up and track threads
{
// poll stack of computations to see if any have finished,
// extract data and remove completed ones from stack
std::list < thread_detail >::iterator i(thread_table.begin());
while (i!=thread_table.end())
{
if (pthread_kill(*i->threadID,0)!=0) // thread is dead
{ // if there was relevant info in *i we'd extract it here
if (pthread_join(*i->threadID, NULL)!=0) { std::cout << "Thread join error!\n"; exit(1); }
pthread_mutex_lock(&coutLock);
std::cout << i->num << " done. "; std::cout.flush();
pthread_mutex_unlock(&coutLock);
delete i->threadID;
thread_table.erase(i++);
comp_DONE++;
}
else (i++);
}
// if list not full, toss another on the pile
while ( (thread_table.size() < NUM_THREADS) && (comp_TODO > comp_START) )
{
pthread_t *tId( new pthread_t );
thread_detail Y; Y.threadID=tId; Y.num=comp_START;
thread_table.push_back(Y);
int rc( pthread_create( tId, NULL, ThreadToSpawn, (void *)(&(thread_table.back() )) ) );
if (rc) { printf("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); }
pthread_mutex_lock(&coutLock);
std::cout << comp_START << " start. "; std::cout.flush();
pthread_mutex_unlock(&coutLock);
comp_START++;
}
// wait a specified amount of time
struct timeval timeout;
timeout.tv_sec = 0; timeout.tv_usec = ms_LAG;
select( 0, NULL, NULL, NULL, & timeout );
} // the big while loop
pthread_exit(NULL);
}
Helgrind 输出
==2849== Helgrind, a thread error detector
==2849== Copyright (C) 2007-2009, and GNU GPL'd, by OpenWorks LLP et al.
==2849== Using Valgrind-3.6.0.SVN-Debian and LibVEX; rerun with -h for copyright info
==2849== Command: ./thread2 2 6
==2849==
Program will have 2 threads.
==2849== Thread #2 was created
==2849== at 0x64276BE: clone (clone.S:77)
==2849== by 0x555E172: pthread_create@@GLIBC_2.2.5 (createthread.c:75)
==2849== by 0x4C2D42C: pthread_create_WRK (hg_intercepts.c:230)
==2849== by 0x4C2D4CF: pthread_create@* (hg_intercepts.c:257)
==2849== by 0x401374: main (in /home/rybu/prog/regina/exercise/thread2)
==2849==
==2849== Thread #1 is the program's root thread
==2849==
==2849== Possible data race during write of size 8 at 0x7feffffe0 by thread #2
==2849== at 0x4C2D54C: mythread_wrapper (hg_intercepts.c:200)
==2849== This conflicts with a previous read of size 8 by thread #1
==2849== at 0x4C2D440: pthread_create_WRK (hg_intercepts.c:235)
==2849== by 0x4C2D4CF: pthread_create@* (hg_intercepts.c:257)
==2849== by 0x401374: main (in /home/rybu/prog/regina/exercise/thread2)
==2849==
[0 start.] [1 start.] 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 [0 done.] [1 done.] [2 start.] [3 start.] 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 [2 done.] [3 done.] [4 start.] [5 start.] 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 4 5 [4 done.] [5 done.] ==2849==
==2849== For counts of detected and suppressed errors, rerun with: -v
==2849== Use --history-level=approx or =none to gain increased speed, at
==2849== the cost of reduced accuracy of conflicting-access information
==2849== ERROR SUMMARY: 6 errors from 1 contexts (suppressed: 675 from 37)
大概我以不正确的方式使用 pthreads,但我并不清楚我做错了什么。此外,我不确定如何制作 helmgrind 输出。早些时候 helgrind 抱怨说,因为我没有在由于其他原因代码知道已死的线程上调用 pthread_join。添加 pthread_join 处理了这些投诉。
在线阅读各种 pthread 教程后,我发现像上面的代码一样,进行如此多的线程创建和销毁可能毫无意义。让 N 个线程同时运行可能更有效,并使用互斥锁和共享内存在“BOSS”线程和“WORKER”线程之间传递数据,只在程序结束时杀死 WORKER 线程一次。所以这是我最终必须调整的东西,但是上面的代码有什么明显的问题吗?
编辑:我越来越频繁地注意到一些关键字。我试图创建的东西的术语显然是一个线程池。此外,对此标准实现有各种建议,例如在 boost 库中有 boost::threadpool、boost::task、boost::thread。其中一些似乎只是提议。我在这里遇到人们提到you can combine ASIO and boost::thread 来完成我正在寻找的东西的线程。同样有一个消息队列类。
嗯,所以我似乎只是在探讨当今许多人都在思考的一个话题,但它似乎是一种萌芽,就像 OOP 在 1989 年之类的那样。
【问题讨论】:
-
代码过多/代码不足 - 尝试将其减少到仍然存在问题的最小样本,您不妨在过程中找到错误。
-
尝试修复 helgrind 识别的所有种族,然后重试。
-
您现在在该测试程序中看到的大多数比赛是因为您试图同时访问 stdio (
std::cout) - 在 stdout 调用周围加一些锁以清除它。跨度> -
谢谢。这消除了除第一个 helmgrind 投诉之外的所有内容。我经常看到这种情况,即使在其他人的代码中似乎也很受欢迎,例如 Daniel Robbins 的代码:ibm.com/developerworks/linux/library/l-posix3/…