【问题标题】:Julia Threads.@threads slower than single thread performanceJulia Threads.@threads 比单线程性能慢
【发布时间】:2021-01-04 01:42:03
【问题描述】:

我正在尝试以数值方式求解 1d 中的热方程:

我正在使用有限差分,但在 Julia 中使用 @threads 指令时遇到了一些问题。特别是下面有两个版本的相同代码:第一个是单线程,另一个使用@threads(除了@thread指令之外它们是相同的)

function heatSecLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    
    println("starting loop")
    for t=1:Nt-1
        u_old = copy(u)
        for i=2:Nx-1
            u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i.-1]-2u_old[i] + u_old[i.+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(round(t*Δt,digits=4)) )
        end
    end
    println("done")
    return u
end

function heatParLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    
    println("starting loop")
    for t=1:Nt-1
        u_old = copy(u)
        Threads.@threads for i=2:Nx-1
            u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i.-1]-2u_old[i] + u_old[i.+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(round(t*Δt,digits=4)) )
        end
    end
    println("done")
    return u
end

问题是顺序的比多线程的快。以下是时间(运行一次编译后)

julia> Threads.nthreads()
2

julia> @time heatParLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  5.417182 seconds (12.14 M allocations: 9.125 GiB, 6.59% gc time)

julia> @time heatSecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  3.892801 seconds (1.00 M allocations: 7.629 GiB, 8.06% gc time)

当然,热方程只是一个更复杂问题的示例。我还尝试将 SharedArrays 等其他库与 Distributed 一起使用,但效果更差。

感谢任何帮助。

【问题讨论】:

  • 请查看here 以获得部分解决方案

标签: multithreading parallel-processing julia pde


【解决方案1】:

这似乎仍然成立,可能是由于

  1. Threads.@threads 的开销
  2. 也许在较小程度上,Julia 中的垃圾收集是单线程的,而这里的原始版本会产生相当多的垃圾。

此外,基于链接讨论线程的建议,可能值得注意的是,现在有一个来自LoopVectorization.jl@avx(现在是@turbo)宏的线程版本,它使用了来自Polyester.jl,尽管线程的开销仍然不小,但仍设法勉强获得更好的性能:

function heatSecLoop(;T::Float64)

    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)    
    u[round(Int,Nx/2)] = 1
    u_old = similar(u)

    println("starting loop")
    for t=1:Nt-1
        u_old, u = u, u_old
        for i=2:Nx-1
            u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i.-1]-2u_old[i] + u_old[i.+1])
        end

        if t % round(Int,Nt/10) == 0
            println("time = " * string(round(t*Δt,digits=4)) )
        end
    end
    println("done")
    return u
end
function heatVecLoop(;T::Float64)
    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)
    u[round(Int,Nx/2)] = 1
    u_old = similar(u)

    println("starting loop")
    for t=1:Nt-1
       u_old, u = u, u_old
       @tturbo for i=2:Nx-1
           u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2u_old[i] + u_old[i+1])
       end

       if t % round(Int,Nt/10) == 0
           println("time = " * string(round(t*Δt,digits=4)) )
       end
    end
    println("done")
    return u
end

function heatTVecLoop(;T::Float64)
    println("start")
    L = 1
    ν = 0.5
    Δt = 1e-6
    Δx = 1e-3

    Nt = ceil(Int, T/Δt )
    Nx = ceil(Int,L/Δx + 2)
    u = zeros(Nx)
    u[round(Int,Nx/2)] = 1
    u_old = similar(u)

    println("starting loop")
    for t=1:Nt-1
       u_old, u = u, u_old
       @tturbo for i=2:Nx-1
           u[i] = u_old[i] + ν * Δt/(Δx^2)*(u_old[i-1]-2u_old[i] + u_old[i+1])
       end

       if t % round(Int,Nt/10) == 0
           println("time = " * string(round(t*Δt,digits=4)) )
       end
    end
    println("done")
    return u
end
julia> @time heatSecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  1.786011 seconds (114 allocations: 22.094 KiB)

julia> @time heatVecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  0.314305 seconds (114 allocations: 22.094 KiB)

julia> @time heatTVecLoop(T=1.0)
start
starting loop
time = 0.1
time = 0.2
time = 0.3
time = 0.4
time = 0.5
time = 0.6
time = 0.7
time = 0.8
time = 0.9
done
  0.300656 seconds (114 allocations: 22.094 KiB)

自从首次提出这个问题以来,单线程@turbo-vectorized 版本的性能似乎也有了显着提高,而对于更大的问题规模,多线程@tturbo 版本的性能可能会继续提高。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-10-22
    • 1970-01-01
    • 2015-11-06
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多