在 Matlab/Python 中矢量化多个 for 循环答案

【问题标题】：Vectorizing multiple for loops in Matlab/Python在 Matlab/Python 中矢量化多个 for 循环
【发布时间】：2019-09-20 21:19:54
【问题描述】：

我正在尝试编写一个数学模型，它涉及在数值网格上计算特定数量数千次，其中一些模型参数会发生变化。目前，这太慢了，我正在寻找有关矢量化模型中最密集部分的建议。

为了便于阅读，我目前已经有了它的基本实现，但现在想尽可能对下面的整个代码段进行矢量化。代码段的最小示例是：

% Setup grid to evaluate and results vector
T_max = 10000;
eval_points = linspace(0, T_max, 1000);
results = zeros(size(eval_points));
% Function that is used in computation
Z_func = @(x, omega) (1./(omega.*sqrt(2*pi))).*exp( -(x.^2)./(2.*omega.*omega) );
% Random data for now, known in full problem
historic_weights = rand(1,100);
historic_times   = rand(1,100);
% Fixed single parameter omega
omega            = 0.5;
% Time evaluation
tic()
for eval_counter = 1:size(eval_points,2)
    for historic_counter = 1:size(historic_weights,2)
    temp_result = 0;
        for k = 0:1:T_max
            temp_result = temp_result + Z_func( eval_points(eval_counter) - historic_times(historic_counter) + 1440*floor(historic_times(historic_counter)/1440) - 1440*k, omega );
        end % End of looping over k
        results(eval_counter) = results(eval_counter) + historic_weights(historic_counter)*temp_result;
    end % End of looping over weights 
end % End of looping over evaluation points
toc()

在我的电脑上，评估只用了 60 多秒。我不希望使用并行工具箱，因为我已经在其他地方使用过，并且显示的代码段会在每个进程上调用。

如果这在 Matlab 中是不可能的，我很高兴也可以在 python 中尝试。

【问题讨论】：

标签： matlab performance vectorization

【解决方案1】：

通过将temp_result 和result 计算为矩阵而不是一次计算一个，您可以相当容易地将内部两个循环向量化。例如：

for eval_counter = 1:size(eval_points,2)
    temp_result = sum(Z_func( eval_points(eval_counter) - historic_times + 1440*floor(historic_times/1440) - 1440*(0:1:T_max)', omega ));
    results(eval_counter) = results(eval_counter) + sum(historic_weights.*temp_result);
end % End of looping over evaluation points

这在我的机器上运行大约 9 秒，而循环版本需要 73 秒。

现在，理论上您可以在没有单个循环的情况下执行此操作，如下所示：

eval_points = linspace(0,T_max,1000);
historic_weights = rand(100,1); % Note transposed from original
historic_times   = rand(100,1);
eval_loop = reshape(0:T_max,1,1,[]); % size = [1,1,10000];

result = sum(historic_weight.*sum(Z_func(eval_points - historic_times + 1440*floor(historic_times/1440) - 1440*eval_loop, omega ),3),1);

但是，这将使用大量内存 (>8 GB)，因此对于您当前的情况可能不可行。我当前的机器上没有足够的内存来测试它，所以我不知道它会运行多快，但理论上它应该更快，因为代码中没有任何 for 循环。

【讨论】：

第一组代码在我的机器上运行大约 3 秒，已经是一个巨大的进步。我确实可以访问数百 GB 的 RAM，所以我尝试了第二个代码，但在结果 = 行上我收到错误“矩阵尺寸必须一致。”你知道某些东西是否应该被转置而不应该被转置吗？
哦，对不起。我忘记用eval_loop 替换(0:1:T_max)'。我已经修复了它，它现在应该可以工作了。
您是否检查了相同的结果？如果我针对T_max = 10 和eval_points = linspace(0, T_max, 10) 的第一个解决方案运行原始脚本，则这两个results 不同（相当大）。也许，我做错了什么 - 这就是为什么我问你是否可以重现原始结果！？
我会仔细检查。
我的原始代码中实际上有一个错字，变量 temp_result 在每次循环 k 之前没有设置为 0。 MrAzzaman 的答案实际上是正确的，既矢量化又修复了这个错字。我已经更改了问题以反映这一点