【问题标题】:Performance problem with global variables when using parallel code使用并行代码时全局变量的性能问题
【发布时间】:2021-03-03 17:27:10
【问题描述】:

我有一些代码,它同时执行许多任务并出现性能问题,对于这个问题,我创建了出现相同问题的简化代码。

在这个简化的代码中,两个任务使用Parallel.ForEach 同时执行。同时被评估的任务在一个长的for循环中迭代,并且在每次迭代中,它都会改变一个整数变量。如果这两个任务都改变了一个局部整数变量,或者一个改变了一个局部整数变量而一个改变了一个全局变量,并行代码几乎只需要串行代码的一半时间(串行代码大约需要 4.5 秒和并行代码大约需要 2.5 秒)。但是如果两个任务在每个循环中同时更改不同的全局整数变量,或者如果一个任务更改一个全局变量而其他任务访问它,则并行代码中的性能更差(串行评估大约需要 5.0 秒,并行评估需要约 7.5 秒)。这两个任务都改变了不同的变量(甚至是原子数据类型),因此我不希望出现某种竞争条件,但显然仍有一些可疑的事情发生。

我想知道发生了什么,以及这个问题的解决方案是否会改变算法(在这个简单的代码中,算法是改变变量的 for 循环),这样不经常是全局变量会被改变,或者如果有一个技巧或我忽略的东西,可以在不改变算法的情况下解决这个问题。

代码如下:

using System.Diagnostics;
using System.Threading.Tasks;
using System;

public class Program
{
    static void Main()
    {
        Program prog = new Program();
    }

    int intField1;
    int intField2;

    public Program()
    {
        this.intField1 = 0;
        this.intField2 = 0;
        Stopwatch watch = new Stopwatch();

        //Here we evaluate a task, 
        //normal serial Evaluation
        Console.WriteLine("serial evaluation");
        watch.Start();
        for (int j = 0; j < 2; j++)
        {
            this.TaskThatTakesFewSeconds(j);
        }
        Console.WriteLine("Elapsed milliseconds: " + watch.ElapsedMilliseconds);
        watch.Stop();

        this.intField1 = 0;
        this.intField2 = 0;

        watch = new Stopwatch();
        Console.WriteLine("parallel evaluation");
        watch.Start();
        //parallel Evaluation
        int[] loops = new int[2] { 0, 1 };
        Parallel.ForEach(loops, x =>
            this.TaskThatTakesFewSeconds(x)
        );
        Console.WriteLine("Elapsed milliseconds: " + watch.ElapsedMilliseconds);
        watch.Stop();
    }

    public void TaskThatTakesFewSeconds(int k
    {
        int localVariable = 0;
        if (k == 0)
        {
            for (ulong j = 0; j < 1000000000; j++)
            {
                //leave one of the next two lines commented
                //localVariable++;
                this.intField1++;
            }
        }
        else
        {
            for (ulong j = 0; j < 1000000000; j++)
            {
                //leave one of the next two lines commented
                //localVariable++;
                this.intField2++;
            }
        }
    }
}

【问题讨论】:

  • 无论如何,++ 来自两个线程的全局变量是一个竞争条件。在您的真实代码中,是声明了volatile 的两个变量中的一个还是使用了Interlocked
  • 您的并行代码无法正常工作,因为您以不安全的方式访问共享状态,因此性能问题无关紧要。只有在代码实际运行之后才值得考虑性能。
  • @Charlieface 在我的代码应用程序中,我使用的变量大多是来自 Math.Net 库的 doubleVector&lt;double&gt; 类型。这些不是原子数据类型,因此它们不能被声明为 volatile 并且 Interlocked 也不能被使用。但很高兴知道,两个不同的全局变量的变化也可以引发竞争条件,我不知道(尽管我在玩了一些代码之后假设它)。
  • @Servy 在我的实际应用程序(它基本上评估许多相对较小的线性代数计算)中,所有任务都会更改不同的变量。无论计算如何,串行评估的结果总是与并行结果匹配 100%,只是性能更差,正如我所说。如果每个任务只更改不同的变量,并行化代码怎么会无法正常工作?在这个简单的代码示例中,串行运行和并行运行的结果始终相同。
  • @lennartgro 那么这不是您真正问题的代表性示例。您在这里的问题是专门关于改变共享状态的,所以如果您的真实示例没有这样做,那么这就是一个问题。至于结果,竞争条件的性质意味着它们不会持续可靠地中断,它们只是有可能根据不同操作最终实际运行它的顺序以多种不同方式表现。运行一次它就可以工作并不意味着没有竞争条件错误。

标签: c# performance task-parallel-library race-condition parallel.foreach


【解决方案1】:

我强烈建议您使用 StopWatch 进行绩效评估。
我鼓励你使用Benchmark.NET

  • 这是一个简单易用但功能强大的微基准测试工具。

让我告诉你应该如何设置测试环境。

ITest

这个接口定义了每个测试用例的公共表面

public interface ITest
{
    void Execute();
}

Computation

该类包含通用逻辑

public class Computation
{
    private int intField1;
    private int intField2;
    public void TaskThatTakesFewSeconds(int k)
    {
        if (k == 0)
        {
            for (ulong j = 0; j < 1000000000; j++)
            {
                intField1++;
            }
        }
        else
        {
            for (ulong j = 0; j < 1000000000; j++)
            {
                intField2++;
            }
        }
    }
}

SequentialTest

这个类包含一个实现变体,它将顺序执行两个操作

public class SequentialTest: ITest
{
    private readonly Computation _comp;
    public SequentialTest()
    {
        this._comp = new Computation();
    }
    public void Execute()
    {
        for (int j = 0; j < 2; j++)
        {
            this._comp.TaskThatTakesFewSeconds(j);
        }
    }
}

ParallelForeachTestParallelInvokeTest

这些类包含不同的实现变体。
在这两个类中,操作将同时执行

public class ParallelForeachTest: ITest
{
    private readonly Computation _comp;

    public ParallelForeachTest()
    {
        _comp = new Computation();
    }

    public void Execute()
    {
        var loops = new [] { 0, 1 };
        Parallel.ForEach(loops, this._comp.TaskThatTakesFewSeconds);
    }
}
public class ParallelInvokeTest: ITest
{
    private readonly Computation _comp;

    public ParallelInvokeTest()
    {
        _comp = new Computation();
    }

    public void Execute()
    {
        Parallel.Invoke(
            () => this._comp.TaskThatTakesFewSeconds(0), 
            () => this._comp.TaskThatTakesFewSeconds(1));
    }
}

TestCase

这个类负责设置实验

[HtmlExporter]
[MemoryDiagnoser]
[SimpleJob(BenchmarkDotNet.Engines.RunStrategy.ColdStart, targetCount: 5)]
public class TestCase
{
    [Benchmark(Baseline = true)]
    public void RunBaseLine() => RunExperiment<SequentialTest>();

    [Benchmark]
    public void RunParallelForEach() => RunExperiment<ParallelForeachTest>();

    [Benchmark]
    public void RunParallelInvoke() => RunExperiment<ParallelInvokeTest>();

    internal void RunExperiment<T>() where T : ITest, new()
    {
        new T().Execute();
    }
}
  • MemoryDiagnoser: Benchmark 也会监控内存使用情况
  • SimpleJob: 这里我们定义迭代
    • 会有不计入最终结果的迭代。 (冷启动
    • 实验期间将计算 5 次迭代。
  • Benchmark(Baseline = true):顺序变体将用作基线。
    • 所有其他实现都会与此相关(比率)

Program

我们启动实验的控制台应用程序的入口点

class Program
{
    static void Main(string[] args)
    {
        BenchmarkRunner.Run<TestCase>();
        Console.ReadLine();
    }
}

注意:请确保在应用符合发布模式时运行此实验。

我的笔记本电脑上的结果:

TL;DR

|             Method |    Mean |    Error |   StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |--------:|---------:|---------:|------:|------:|------:|------:|----------:|
|        RunBaseLine | 2.711 s | 0.0307 s | 0.0080 s |  1.00 |     - |     - |     - |     432 B |
| RunParallelForEach | 1.944 s | 0.1432 s | 0.0372 s |  0.72 |     - |     - |     - |    2696 B |
|  RunParallelInvoke | 1.975 s | 0.1283 s | 0.0333 s |  0.73 |     - |     - |     - |     856 B |

已满

// Validating benchmarks:
// ***** BenchmarkRunner: Start   *****
// ***** Found 3 benchmark(s) in total *****
// ***** Building 1 exe(s) in Parallel: Start   *****
// start dotnet restore  /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 in C:\...\MySimpleBenchmark\bin\Release\netcoreapp3.1\7dd97576-8d82-459f-8018-efbdb1d641bc
// command took 1.35s and exited with 0
// start dotnet build -c Release  --no-restore /p:UseSharedCompilation=false /p:BuildInParallel=false /m:1 in C:\...\MySimpleBenchmark\bin\Release\netcoreapp3.1\7dd97576-8d82-459f-8018-efbdb1d641bc
// command took 2.23s and exited with 0
// ***** Done, took 00:00:03 (3.7 sec)   *****
// Found 3 benchmarks:
//   TestCase.RunBaseLine: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
//   TestCase.RunParallelForEach: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
//   TestCase.RunParallelInvoke: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)

// **************************
// Benchmark: TestCase.RunBaseLine: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "7dd97576-8d82-459f-8018-efbdb1d641bc.dll" --benchmarkName "MySimpleBenchmark.TestCase.RunBaseLine" --job "IterationCount=5, RunStrategy=ColdStart" --benchmarkId 0 in C:\...\MySimpleBenchmark\bin\Release\netcoreapp3.1\7dd97576-8d82-459f-8018-efbdb1d641bc\bin\Release\netcoreapp3.1
// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT
// GC=Concurrent Workstation
// Job: Job-MCAVLE(IterationCount=5, RunStrategy=ColdStart)

// BeforeActualRun
WorkloadActual   1: 1 op, 2703012600.00 ns, 2.7030 s/op
WorkloadActual   2: 1 op, 2722115500.00 ns, 2.7221 s/op
WorkloadActual   3: 1 op, 2714919500.00 ns, 2.7149 s/op
WorkloadActual   4: 1 op, 2704378800.00 ns, 2.7044 s/op
WorkloadActual   5: 1 op, 2708101600.00 ns, 2.7081 s/op

// AfterActualRun
WorkloadResult   1: 1 op, 2703012600.00 ns, 2.7030 s/op
WorkloadResult   2: 1 op, 2722115500.00 ns, 2.7221 s/op
WorkloadResult   3: 1 op, 2714919500.00 ns, 2.7149 s/op
WorkloadResult   4: 1 op, 2704378800.00 ns, 2.7044 s/op
WorkloadResult   5: 1 op, 2708101600.00 ns, 2.7081 s/op
GC:  0 0 0 432 1
Threading:  2 0 1

// AfterAll
// Benchmark Process 30236 has exited with code 0

Mean = 2.711 s, StdErr = 0.004 s (0.13%), N = 5, StdDev = 0.008 s
Min = 2.703 s, Q1 = 2.704 s, Median = 2.708 s, Q3 = 2.715 s, Max = 2.722 s
IQR = 0.011 s, LowerFence = 2.689 s, UpperFence = 2.731 s
ConfidenceInterval = [2.680 s; 2.741 s] (CI 99.9%), Margin = 0.031 s (1.13% of Mean)
Skewness = 0.39, Kurtosis = 1.15, MValue = 2

// **************************
// Benchmark: TestCase.RunParallelForEach: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "7dd97576-8d82-459f-8018-efbdb1d641bc.dll" --benchmarkName "MySimpleBenchmark.TestCase.RunParallelForEach" --job "IterationCount=5, RunStrategy=ColdStart" --benchmarkId 1 in C:\...\MySimpleBenchmark\bin\Release\netcoreapp3.1\7dd97576-8d82-459f-8018-efbdb1d641bc\bin\Release\netcoreapp3.1
// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT
// GC=Concurrent Workstation
// Job: Job-OWEIXV(IterationCount=5, RunStrategy=ColdStart)

// BeforeActualRun
WorkloadActual   1: 1 op, 1885435300.00 ns, 1.8854 s/op
WorkloadActual   2: 1 op, 1951180900.00 ns, 1.9512 s/op
WorkloadActual   3: 1 op, 1989053900.00 ns, 1.9891 s/op
WorkloadActual   4: 1 op, 1944026900.00 ns, 1.9440 s/op
WorkloadActual   5: 1 op, 1948992000.00 ns, 1.9490 s/op

// AfterActualRun
WorkloadResult   1: 1 op, 1885435300.00 ns, 1.8854 s/op
WorkloadResult   2: 1 op, 1951180900.00 ns, 1.9512 s/op
WorkloadResult   3: 1 op, 1989053900.00 ns, 1.9891 s/op
WorkloadResult   4: 1 op, 1944026900.00 ns, 1.9440 s/op
WorkloadResult   5: 1 op, 1948992000.00 ns, 1.9490 s/op
GC:  0 0 0 2696 1
Threading:  6 0 1

// AfterAll
// Benchmark Process 21660 has exited with code 0

Mean = 1.944 s, StdErr = 0.017 s (0.86%), N = 5, StdDev = 0.037 s
Min = 1.885 s, Q1 = 1.944 s, Median = 1.949 s, Q3 = 1.951 s, Max = 1.989 s
IQR = 0.007 s, LowerFence = 1.933 s, UpperFence = 1.962 s
ConfidenceInterval = [1.800 s; 2.087 s] (CI 99.9%), Margin = 0.143 s (7.37% of Mean)
Skewness = -0.41, Kurtosis = 1.65, MValue = 2

// **************************
// Benchmark: TestCase.RunParallelInvoke: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
// *** Execute ***
// Launch: 1 / 1
// Execute: dotnet "7dd97576-8d82-459f-8018-efbdb1d641bc.dll" --benchmarkName "MySimpleBenchmark.TestCase.RunParallelInvoke" --job "IterationCount=5, RunStrategy=ColdStart" --benchmarkId 2 in C:\...\MySimpleBenchmark\bin\Release\netcoreapp3.1\7dd97576-8d82-459f-8018-efbdb1d641bc\bin\Release\netcoreapp3.1
// BeforeAnythingElse

// Benchmark Process Environment Information:
// Runtime=.NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT
// GC=Concurrent Workstation
// Job: Job-XWLZBM(IterationCount=5, RunStrategy=ColdStart)

// BeforeActualRun
WorkloadActual   1: 1 op, 1976197800.00 ns, 1.9762 s/op
WorkloadActual   2: 1 op, 1972453200.00 ns, 1.9725 s/op
WorkloadActual   3: 1 op, 1967741600.00 ns, 1.9677 s/op
WorkloadActual   4: 1 op, 2026004200.00 ns, 2.0260 s/op
WorkloadActual   5: 1 op, 1932835200.00 ns, 1.9328 s/op

// AfterActualRun
WorkloadResult   1: 1 op, 1976197800.00 ns, 1.9762 s/op
WorkloadResult   2: 1 op, 1972453200.00 ns, 1.9725 s/op
WorkloadResult   3: 1 op, 1967741600.00 ns, 1.9677 s/op
WorkloadResult   4: 1 op, 2026004200.00 ns, 2.0260 s/op
WorkloadResult   5: 1 op, 1932835200.00 ns, 1.9328 s/op
GC:  0 0 0 856 1
Threading:  3 0 1

// AfterAll
// Benchmark Process 11348 has exited with code 0

Mean = 1.975 s, StdErr = 0.015 s (0.75%), N = 5, StdDev = 0.033 s
Min = 1.933 s, Q1 = 1.968 s, Median = 1.972 s, Q3 = 1.976 s, Max = 2.026 s
IQR = 0.008 s, LowerFence = 1.955 s, UpperFence = 1.989 s
ConfidenceInterval = [1.847 s; 2.103 s] (CI 99.9%), Margin = 0.128 s (6.50% of Mean)
Skewness = 0.31, Kurtosis = 1.61, MValue = 2

// ***** BenchmarkRunner: Finish  *****

// * Export *
  BenchmarkDotNet.Artifacts\results\MySimpleBenchmark.TestCase-report.csv
  BenchmarkDotNet.Artifacts\results\MySimpleBenchmark.TestCase-report-github.md
  BenchmarkDotNet.Artifacts\results\MySimpleBenchmark.TestCase-report.html

// * Detailed results *
TestCase.RunBaseLine: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
Runtime = .NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT; GC = Concurrent Workstation
Mean = 2.711 s, StdErr = 0.004 s (0.13%), N = 5, StdDev = 0.008 s
Min = 2.703 s, Q1 = 2.704 s, Median = 2.708 s, Q3 = 2.715 s, Max = 2.722 s
IQR = 0.011 s, LowerFence = 2.689 s, UpperFence = 2.731 s
ConfidenceInterval = [2.680 s; 2.741 s] (CI 99.9%), Margin = 0.031 s (1.13% of Mean)
Skewness = 0.39, Kurtosis = 1.15, MValue = 2
-------------------- Histogram --------------------
[2.697 s ; 2.728 s) | @@@@@
---------------------------------------------------

TestCase.RunParallelForEach: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
Runtime = .NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT; GC = Concurrent Workstation
Mean = 1.944 s, StdErr = 0.017 s (0.86%), N = 5, StdDev = 0.037 s
Min = 1.885 s, Q1 = 1.944 s, Median = 1.949 s, Q3 = 1.951 s, Max = 1.989 s
IQR = 0.007 s, LowerFence = 1.933 s, UpperFence = 1.962 s
ConfidenceInterval = [1.800 s; 2.087 s] (CI 99.9%), Margin = 0.143 s (7.37% of Mean)
Skewness = -0.41, Kurtosis = 1.65, MValue = 2
-------------------- Histogram --------------------
[1.857 s ; 1.914 s) | @
[1.914 s ; 1.995 s) | @@@@
---------------------------------------------------

TestCase.RunParallelInvoke: Job-RWBPOP(IterationCount=5, RunStrategy=ColdStart)
Runtime = .NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT; GC = Concurrent Workstation
Mean = 1.975 s, StdErr = 0.015 s (0.75%), N = 5, StdDev = 0.033 s
Min = 1.933 s, Q1 = 1.968 s, Median = 1.972 s, Q3 = 1.976 s, Max = 2.026 s
IQR = 0.008 s, LowerFence = 1.955 s, UpperFence = 1.989 s
ConfidenceInterval = [1.847 s; 2.103 s] (CI 99.9%), Margin = 0.128 s (6.50% of Mean)
Skewness = 0.31, Kurtosis = 1.61, MValue = 2
-------------------- Histogram --------------------
[1.929 s ; 2.000 s) | @@@@
[2.000 s ; 2.052 s) | @
---------------------------------------------------

// * Summary *

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.18363.1379 (1909/November2018Update/19H2)
Intel Core i7-8665U CPU 1.90GHz (Coffee Lake), 1 CPU, 8 logical and 4 physical cores
.NET Core SDK=5.0.103
  [Host]     : .NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT
  Job-RWBPOP : .NET Core 3.1.12 (CoreCLR 4.700.21.6504, CoreFX 4.700.21.6905), X64 RyuJIT

IterationCount=5  RunStrategy=ColdStart

|             Method |    Mean |    Error |   StdDev | Ratio | Gen 0 | Gen 1 | Gen 2 | Allocated |
|------------------- |--------:|---------:|---------:|------:|------:|------:|------:|----------:|
|        RunBaseLine | 2.711 s | 0.0307 s | 0.0080 s |  1.00 |     - |     - |     - |     432 B |
| RunParallelForEach | 1.944 s | 0.1432 s | 0.0372 s |  0.72 |     - |     - |     - |    2696 B |
|  RunParallelInvoke | 1.975 s | 0.1283 s | 0.0333 s |  0.73 |     - |     - |     - |     856 B |

// * Hints *
Outliers
  TestCase.RunParallelForEach: IterationCount=5, RunStrategy=ColdStart -> 2 outliers were detected (1.89 s, 1.99 s)
  TestCase.RunParallelInvoke: IterationCount=5, RunStrategy=ColdStart  -> 2 outliers were detected (1.93 s, 2.03 s)

// * Legends *
  Mean      : Arithmetic mean of all measurements
  Error     : Half of 99.9% confidence interval
  StdDev    : Standard deviation of all measurements
  Ratio     : Mean of the ratio distribution ([Current]/[Baseline])
  Gen 0     : GC Generation 0 collects per 1000 operations
  Gen 1     : GC Generation 1 collects per 1000 operations
  Gen 2     : GC Generation 2 collects per 1000 operations
  Allocated : Allocated memory per single operation (managed only, inclusive, 1KB = 1024B)
  1 s       : 1 Second (1 sec)

// * Diagnostic Output - MemoryDiagnoser *


// ***** BenchmarkRunner: End *****
// ** Remained 0 benchmark(s) to run **
Run time: 00:00:40 (40.77 sec), executed benchmarks: 3

Global total time: 00:00:44 (44.48 sec), executed benchmarks: 3
// * Artifacts cleanup *

【讨论】:

  • 不,对于这种情况,Stopwatch 绝对足够了。当观察到的性能差异为一个数量级或更多时,您不需要 Benchmark.NET。在这种情况下使用 Benchmark.NET 就像使用外科工具在墙上钉钉子一样。
  • @TheodorZoulias OP 发布了一个最小的、可重现的示例,其中他使用了StopWatch 进行性能测量。我所介绍的技术也适用于更复杂的实验。在我看来,与依赖 StopWatch 相比,您可以通过这种方法更好地了解性能增益或损失。
  • 您可以通过进行更多观察或更准确的观察来获得更好的洞察力。 Benchmark.NET 阻碍了第一种方法,它使每次观察都变得非常缓慢和乏味。必须等待 10 分钟才能进行 ONE,极其准确的测量,在 Benchmark.NET 进行计算时旋转手指,并不是每个人都喜欢的。在这种特定情况下,恕我直言,这显然不值得。
猜你喜欢
  • 2011-07-17
  • 2012-09-17
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多