Windows下线程创建和终止需要多长时间？答案

【问题标题】：How long does thread creation and termination take under Windows?Windows下线程创建和终止需要多长时间？
【发布时间】：2013-08-18 21:55:15
【问题描述】：

我已将一个复杂的数组处理任务拆分为多个线程，以利用多核处理，并看到了巨大的好处。目前，在任务开始时，我创建线程，然后等待它们在完成工作时终止。我通常创建的线程数量大约是内核数量的四倍，因为每个线程可能花费不同的时间，并且拥有额外的线程可以确保所有内核大部分时间都被占用。我想知道在程序启动时创建线程，让它们保持空闲直到需要，并在我开始处理时使用它们是否有很大的性能优势。更简单地说，在线程内的处理之外启动和结束一个新线程需要多长时间？我目前正在使用

启动线程

CWinThread *pMyThread = AfxBeginThread(CMyThreadFunc,&MyData,THREAD_PRIORITY_NORMAL);

通常我会在 64 位架构上使用跨 8 个内核的 32 个线程。目前有问题的过程需要

related question here 有帮助，但对我所追求的有点模糊。任何反馈表示赞赏。

【问题讨论】：

显示刷新听起来像是一个频繁的事件，池线程并保持它们空闲，以供进一步重用肯定会有好处。线程创建开销可能不算太重，但仍然是同步开销、虚拟内存占用等。
不是答案，但如果您的任务受 CPU 限制，跨 8 个内核运行 32 个线程将不是一个好的解决方案。使用更接近您可以使用的实际硬件线程数的数字可能会更好。我建议预先创建一个线程池，无论创建它们需要多少，您都可能想要重用它们:)
@DavidRodríguez-dribeas，我尝试了不同数量的线程，发现多于最大并发数的线程在这种情况下效果最好。原因是总运行时间基于最后一个线程完成的时间，每个线程的工作量不同，很难提前估计。如果我有一个线程的工作量明显多于其他线程，这会减慢整个过程，每个核心只有一个线程，因为已完成工作的核心处于空闲状态。
@ShaneMacLaughlin：+1 测量和理解行为。

标签： c++ multithreading performance

【解决方案1】：

很久以前，当我遇到相同的基本问题（以及另一个显而易见的问题）时，我写了这篇文章。我对其进行了更新，不仅显示了创建线程需要多长时间，还显示了线程开始执行需要多长时间：

#include <windows.h>
#include <iostream>
#include <time.h>
#include <vector>

const int num_threads = 32;

const int switches_per_thread = 100000;

DWORD __stdcall ThreadProc(void *start) {
    QueryPerformanceCounter((LARGE_INTEGER *) start);
    for (int i=0;i<switches_per_thread; i++)
        Sleep(0);
    return 0;
}

int main(void) {
    HANDLE threads[num_threads];
    DWORD junk;

    std::vector<LARGE_INTEGER> start_times(num_threads);

    LARGE_INTEGER l;
    QueryPerformanceCounter(&l);

    clock_t create_start = clock();
    for (int i=0;i<num_threads; i++)
        threads[i] = CreateThread(NULL, 
                            0, 
                            ThreadProc, 
                            (void *)&start_times[i], 
                            0, 
                            &junk);
    clock_t create_end = clock();

    clock_t wait_start = clock();
    WaitForMultipleObjects(num_threads, threads, TRUE, INFINITE);
    clock_t wait_end = clock();

    double create_millis = 1000.0 * (create_end - create_start) / CLOCKS_PER_SEC / num_threads;
    std::cout << "Milliseconds to create thread: " << create_millis << "\n";
    double wait_clocks = (wait_end - wait_start);
    double switches = switches_per_thread*num_threads;
    double us_per_switch = wait_clocks/CLOCKS_PER_SEC*1000000/switches;
    std::cout << "Microseconds per thread switch: " << us_per_switch;

    LARGE_INTEGER f;
    QueryPerformanceFrequency(&f);

    for (auto s : start_times) 
        std::cout << 1000.0 * (s.QuadPart - l.QuadPart) / f.QuadPart <<" ms\n";

    return 0;
}

示例结果：

Milliseconds to create thread: 0.015625
Microseconds per thread switch: 0.0479687

前几个线程开始时间如下所示：

0.0632517 ms
0.117348 ms
0.143703 ms
0.18282 ms
0.209174 ms
0.232478 ms
0.263826 ms
0.315149 ms
0.324026 ms
0.331516 ms
0.3956 ms
0.408639 ms
0.4214 ms

请注意，尽管这些恰好是单调递增的，但并不能保证（尽管在这个大方向上肯定有趋势）。

当我第一次写这篇文章时，我使用的单位更有意义——在 33 MHz 486 上，这些结果不是像这样的微小部分。 :-) 我想有一天当我感到雄心勃勃时，我应该重写它以使用 std::async 创建线程并使用 std::chrono 进行计时，但是...

【讨论】：

Jerry Coffin，你的代码大错特错。你没有正确测量时间。线程的创建是异步操作。当您测量“create_end”时，它测量的是请求 64 个线程而不是线程的实际创建所花费的时间。开关时间测量也是如此。您的代码完全错误，只会让其他人感到困惑。请修复或删除它。
杰瑞。当您调用“CreateThread”时 - 这是一个异步方法。即使在创建第一个线程之前，创建 64 个线程的循环也可以终止（对 Create thread 的 64 次调用）。我会给你一个例子。到美国需要多少时间才能拿到一张咧嘴卡？可能几个月或几年。但是只需几分钟就可以向美国移民局发送一封信，要求提供一张咧嘴卡。在你寄出一封信之后，给你一张咧嘴卡的法律辩论是异步的。您的代码测量发送一封信所需的时间。
正确解决方案：ThreadProc() 中的第一行应该是clock()。每个线程在实际开始时将时间戳写入 64 个单元的数组。在调用线程之前，主线程读取这个数组减去它的“create_start”。你看，一个有效的测量是两个时间戳之间的间隔：在调用 CreateThread() 之前和 ThreadProc() 中的第一行。因此，您确切地知道启动线程需要多少时间
附言。 - 你暂停线程的论点是无关紧要的。你的代码永远不会衡量它打算做什么。此外，开关的测量值也不正确，因为您实际上计算了杀死所有 64 个线程所需的时间。
@DanielHsH，你有没有提到删除 Sleep(0) 的编译器，因为这对我来说似乎是一个不寻常的优化。鉴于 Sleep(0) 放弃了线程的剩余时间片，删除它显然会影响多线程程序的性能。

【解决方案2】：

一些建议：

如果您有很多工作项要处理（或者没有太多，但您必须不时重复整个过程），请确保使用某种线程池。这样您就不必一直重新创建线程，并且您原来的问题将不再重要：线程只会被创建一次。我直接使用 QueueUserWorkItem API（因为我的应用程序不使用 MFC），即使那个也不是太痛苦。但在 MFC 中，您可能拥有更高级别的设施来利用线程池。 (http://support.microsoft.com/kb/197728)
尝试为一个工作项选择最佳工作量。当然，这取决于您的软件的功能：它应该是实时的，还是在后台处理数字？如果它不是实时的，那么每个工作项的工作量太少可能会损害性能：通过增加跨线程工作分配的开销比例。
由于硬件配置可能非常不同，如果您的最终用户可以拥有各种机器，您可以在软件启动期间异步包含一些校准例程，这样您就可以估计某些操作需要多长时间。校准的结果可以作为输入，以便稍后为实际计算提供更好的工作尺寸设置。

【讨论】：

【解决方案3】：

我对现代 Windows 调度程序很好奇，所以我编写了另一个测试应用程序。我尽了最大的努力来测量线程停止时间，方法是选择性地启动一个观察线程。

// Tested on Windows 10 v1903 with E5-1660 v3 @ 3.00GHz, 8 Core(s), 16 Logical Processor(s)
// Times are (min, average, max) in milliseconds.

threads: 100, iterations: 1, testStop: true
Start(0.1083, 5.3665, 13.7103) - Stop(0.0341, 1.5122, 11.0660)

threads: 32, iterations: 3, testStop: true
Start(0.1349, 1.6423, 3.5561) - Stop(0.0396, 0.2877, 3.5195)
Start(0.1093, 1.4992, 3.3982) - Stop(0.0351, 0.2734, 2.0384)
Start(0.1159, 1.5345, 3.5754) - Stop(0.0378, 0.4938, 3.2216)

threads: 4, iterations: 3, testStop: true
Start(0.2066, 0.3553, 0.4598) - Stop(0.0410, 0.1534, 0.4630)
Start(0.2769, 0.3740, 0.4994) - Stop(0.0414, 0.1028, 0.2581)
Start(0.2342, 0.3602, 0.5650) - Stop(0.0497, 0.2199, 0.3620)

threads: 4, iterations: 3, testStop: false
Start(0.1698, 0.2492, 0.3713)
Start(0.1473, 0.2477, 0.4103)
Start(0.1756, 0.2909, 0.4295)

threads: 1, iterations: 10, testStop: false
Start(0.1910, 0.1910, 0.1910)
Start(0.1685, 0.1685, 0.1685)
Start(0.1564, 0.1564, 0.1564)
Start(0.1504, 0.1504, 0.1504)
Start(0.1389, 0.1389, 0.1389)
Start(0.1234, 0.1234, 0.1234)
Start(0.1550, 0.1550, 0.1550)
Start(0.2800, 0.2800, 0.2800)
Start(0.1587, 0.1587, 0.1587)
Start(0.1877, 0.1877, 0.1877)

来源：

#include <windows.h>
#include <iostream>
#include <vector>
#include <chrono>
#include <iomanip>

using namespace std::chrono;

struct Test
{
    HANDLE Thread = { 0 };
    time_point<steady_clock> Creation;
    time_point<steady_clock> Started;
    time_point<steady_clock> Stopped;
};

DWORD __stdcall ThreadProc(void* lpParamater) {
    auto test = (Test*)lpParamater;
    test->Started = steady_clock::now();
    return 0;
}

DWORD __stdcall TestThreadsEnded(void* lpParamater) {
    auto& tests = *(std::vector<Test>*)lpParamater;

    std::size_t finished = 0;
    while (finished < tests.size())
    {
        for (auto& test : tests)
        {
            if (test.Thread != NULL && WaitForSingleObject(test.Thread, 0) == WAIT_OBJECT_0)
            {
                test.Stopped = steady_clock::now();
                test.Thread = NULL;
                finished++;
            }
        }
    }

    return 0;
}

duration<double, std::milli> diff(time_point<steady_clock> start, time_point<steady_clock> stop)
{
    return stop - start;
}

struct Stats
{
    double min;
    double average;
    double max;
};

Stats stats(const std::vector<double>& durations)
{
    Stats stats = { 1000, 0, 0 };

    for (auto& duration : durations)
    {
        stats.min = duration < stats.min ? duration : stats.min;
        stats.max = duration > stats.max ? duration : stats.max;
        stats.average += duration;
    }

    stats.average /= durations.size();

    return stats;
}

void TestScheduler(const int threadCount, const int iterations, const bool testStop)
{
    std::cout << "\nthreads: " << threadCount << ", iterations: " << iterations << ", testStop: " << (testStop ? "true" : "false") << "\n";

    for (auto i = 0; i < iterations; i++)
    {
        std::vector<Test> tests(threadCount);
        HANDLE testThreadsEnded = NULL;

        if (testStop)
        {
            testThreadsEnded = CreateThread(NULL, 0, TestThreadsEnded, (void*)& tests, 0, NULL);
        }

        for (auto& test : tests)
        {
            test.Creation = steady_clock::now();
            test.Thread = CreateThread(NULL, 0, ThreadProc, (void*)& test, 0, NULL);
        }

        if (testStop)
        {
            WaitForSingleObject(testThreadsEnded, INFINITE);
        }
        else
        {
            std::vector<HANDLE> threads;
            for (auto& test : tests) threads.push_back(test.Thread);
            WaitForMultipleObjects((DWORD)threads.size(), threads.data(), TRUE, INFINITE);
        }

        std::vector<double> startDurations;
        std::vector<double> stopDurations;
        for (auto& test : tests)
        {
            startDurations.push_back(diff(test.Creation, test.Started).count());
            stopDurations.push_back(diff(test.Started, test.Stopped).count());
        }

        auto startStats = stats(startDurations);
        auto stopStats = stats(stopDurations);

        std::cout << std::fixed << std::setprecision(4);
        std::cout << "Start(" << startStats.min << ", " << startStats.average << ", " << startStats.max << ")";
        if (testStop)
        {
            std::cout << " - ";
            std::cout << "Stop(" << stopStats.min << ", " << stopStats.average << ", " << stopStats.max << ")";
        }
        std::cout << "\n";
    }
}

int main(void)
{
    TestScheduler(100, 1, true);
    TestScheduler(32, 3, true);
    TestScheduler(4, 3, true);
    TestScheduler(4, 3, false);
    TestScheduler(1, 10, false);
    return 0;
}

【讨论】：