分桶算法答案

【问题标题】：Bucketing Algorithm分桶算法
【发布时间】：2009-05-14 12:43:35
【问题描述】：

我有一些有效的代码，但有点瓶颈，我一直在试图弄清楚如何加快它的速度。它在一个循环中，我不知道如何对其进行矢量化。

我有一个二维数组 vals，它代表时间序列数据。行是日期，列是不同的系列。我试图按月存储数据以对其执行各种操作（求和、平均值等）。这是我当前的代码：

allDts; %Dates/times for vals.  Size is [size(vals, 1), 1]
vals;
[Y M] = datevec(allDts);
fomDates = unique(datenum(Y, M, 1)); %first of the month dates

[Y M] = datevec(fomDates);
nextFomDates = datenum(Y, M, DateUtil.monthLength(Y, M)+1);

newVals = nan(length(fomDates), size(vals, 2)); %preallocate for speed

for k = 1:length(fomDates);

下一行是瓶颈，因为我多次调用它。（循环）

    idx = (allDts >= fomDates(k)) & (allDts < nextFomDates(k));
    bucketed = vals(idx, :);
    newVals(k, :) = nansum(bucketed);
end %for

有什么想法吗？提前致谢。

【问题讨论】：

这应该使用accumarray...

标签： matlab vectorization

【解决方案1】：

这是一个很难矢量化的问题。我可以建议一种使用CELLFUN 的方法，但我不能保证它会更快地解决您的问题（您必须自己根据您正在使用的特定数据集进行计时）。正如this other SO question 中所讨论的，矢量化并不总是比 for 循环工作得更快。它可以是非常具体的问题，这是最好的选择。有了该免责声明，我将建议您尝试两种解决方案：CELLFUN 版本和可能运行得更快的 for 循环版本的修改。

CELLFUN 解决方案：

[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1);  % Start date of each month
[monthStart,sortIndex] = sort(monthStart);  % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart);  % Get unique start dates

valCell = mat2cell(vals(sortIndex,:),diff([0 uniqueIndex]));
newVals = cellfun(@nansum,valCell,'UniformOutput',false);

对MAT2CELL 的调用将具有相同开始日期的vals 行组合到单元数组valCell 的单元中。变量 newVals 将是一个长度为 numel(uniqueStarts) 的单元格数组，其中每个单元格将包含对相应单元格执行 nansum 的结果valCell。

FOR-LOOP 解决方案：

[Y,M] = datevec(allDts);
monthStart = datenum(Y,M,1);  % Start date of each month
[monthStart,sortIndex] = sort(monthStart);  % Sort the start dates
[uniqueStarts,uniqueIndex] = unique(monthStart);  % Get unique start dates

vals = vals(sortIndex,:);  % Sort the values according to start date
nMonths = numel(uniqueStarts);
uniqueIndex = [0 uniqueIndex];
newVals = nan(nMonths,size(vals,2));  % Preallocate
for iMonth = 1:nMonths,
  index = (uniqueIndex(iMonth)+1):uniqueIndex(iMonth+1);
  newVals(iMonth,:) = nansum(vals(index,:));
end

【讨论】：

谢谢。这将其速度提高了约 50%！如果我正确理解代码，这一行： valCell = mat2cell(vals,diff([0; uniqueIndex]));是关键 - 它将值分解为单元格，即第二个 arg long 的长度。（不需要排序 - 保证日期及其相关值是排序
是的，听起来你已经掌握了。 MAT2CELL 的第二个参数是第一个参数的行将被分成的大小向量。例如，如果第一个参数是一个 6x3 矩阵（称为 A），而第二个参数是 [1 2 3]，则 MAT2CELL 将返回一个 3 元素元胞数组（称为 B），等于： B = {A (1,:); A(2:3,:); A(4:6,:)}

【解决方案2】：

如果您需要做的只是形成矩阵行的总和或平均值，其中行根据另一个变量（日期）求和，然后使用我的合并函数。它旨在完全执行此操作，根据指标系列的值减少数据。（实际上，consolidator 也可以处理 n-d 数据，并且具有容差，但您需要做的就是将月份和年份信息传递给它。）

Find consolidator on the file exchange on Matlab Central

【讨论】：