多线程处理文件行数的最佳方法答案

【问题标题】：Best way of Multithreading to process lines of number of files多线程处理文件行数的最佳方法
【发布时间】：2017-08-07 11:12:42
【问题描述】：

我有一些单独的文件，我想处理文件的每一行（顺序和独立），我希望它快。

所以我编写了一个代码，将文件的一大块读入内存上的缓冲区，然后多线程将竞争从缓冲区读取行并处理它们。伪代码如下：

do{
  do{      

    fread(buffer,500MB,1,file);
    // creating threads
    // let the threads compete to read from buffer and PROCESS independently
    // end of threads

  while( EOF not reached )
  file = nextfile;
while( there is another file to read )

或者这个：

void mt_ReadAndProcess(){
  lock();
  fread(buffer,50MB,1,file);
  if(EOF reached)
    file = nextfile;
  unlock();
  process();
}
main(){
  // create multi threads
  // call mt_ReadAndProcess() with multi threads
}

过程是一个（及时的）昂贵的过程。

有没有更好的方法来做到这一点？更快地读取文件或使用多线程处理文件的更好方法？

谢谢大家，

阿米尔。

【问题讨论】：

标签： multithreading io fread

【解决方案1】：

为什么你想让线程“竞争从缓冲区读取”？当线程读取数据时，可以轻松地对数据进行分区。争取从缓冲区获取数据一无所获，而且可能会浪费 CPU 和挂钟时间。

由于您正在逐行处理，只需从文件中读取行并将缓冲区通过指针传递给工作线程。

假设您在符合 POSIX 的系统上运行，如下所示：

#include <unistd.h>
#include <pthread.h>

#define MAX_LINE_LEN 1024
#define NUM_THREADS 8

// linePipe holds pointers to lines sent to
// worker threads
static int linePipe[ 2 ];

// bufferPipe holds pointers to buffers returned
// from worker threads and used to read data
static int bufferPipe[ 2 ];

// thread function that actually does the work
void *threadFunc( void *arg )
{
    const char *linePtr;

    for ( ;; )
    {
        // get a pointer to a line from the pipe
        read( linePipe[ 1 ], &linePtr, sizeof( linePtr ) );

        // end loop on NULL linePtr value
        if ( !linePtr )
        {
            break;
        }

        // process line

        // return the buffer
        write( bufferPipe[ 0 ], &linePtr, sizeof( linePtr ) );
    }

    return( NULL );
}

int main( int argc, char **argv )
{
    pipe( linePipe );
    pipe( bufferPipe );

    // create buffers and load them into the buffer pipe for reading
    for ( int ii = 0; ii < ( 2 * NUM_THREADS ); ii++ )
    {
        char *buffer = malloc( MAX_LINE_LEN );
        write( bufferPipe[ 0 ], &buffer, sizeof( buffer ) );
    }

    pthread_t tids[ NUM_THREADS ];
    for ( int ii = 0; ii < NUM_THREADS; ii++ )
    {
        pthread_create( &( tids[ ii ] ), NULL, thread_func, NULL );
    }

    FILE *fp = ...

    for ( ;; )
    {
        char *linePtr;

        // get the pointer to a buffer from the buffer pipe 
        read( bufferPipe[ 1 ], &linePtr, sizeof( linePtr ) );

        // read a line from the current file into the buffer
        char *result = fgets( linePtr, MAX_LINE_LEN, fp );

        if ( result )
        {
            // send the line to the worker threads
            write( linePipe, &linePtr, sizeof( linePtr ) );
        }
        else
        {
            // either end loop, or open another file
            fclose( fp );
            fp = fopen( ... );
        }
    }

    // clean up and exit

    // send NULL to cause worker threads to stop
    char *nullPtr = NULL;
    for ( int ii = 0; ii < NUM_THREADS; ii++ )
    {
        write( linePipe[ 0 ], &nullPtr, sizeof( nullPtr ) );
    }

    // wait for worker threads to stop
    for ( int ii = 0; ii < NUM_THREADS; ii++ )
    {
        pthread_join( tids[ ii ], NULL );
    }

    return( 0 );
}

【讨论】：

你是对的。最好让线程自己阅读。在我的第二个例子中，我有同样的想法..每个线程都会将一个文件块读入它自己的缓冲区，在这种情况下，你能告诉我速度是否有问题吗？还是有更好的主意？
正如你可以在下面的帖子中看到的那样，使用 fread() 一次读取文件的一个大块（或块）比逐行读取该块更快！ 真的吗？您认为您将能够编写与编写操作系统库的开发人员一样快速和可靠的代码吗？您真的认为您可以编写更好更快的代码来将文本文件拆分为单独的行吗？你知道fread()实际上是如何读取数据的吗？对fread() 的调用如何转换为一个或多个实际的read() 系统调用？
这样你就可以编写一个简单的代码来测试它，一次读取整个文件，然后逐行读取！！
请查看此帖子的最佳答案（亚当）并给我您的评论，谢谢stackoverflow.com/questions/24851291/…
您的两行代码导致错误，您能检查一下吗？字符 *buffer = malloc( MAX_LINE_LEN );并写入（linePipe，&linePtr，sizeof（linePtr））；导致无效转化