【问题标题】：NumPy: Average overlapping matricies?NumPy：平均重叠矩阵？
【发布时间】：2020-01-05 06:45:38
【问题描述】：

假设您有一个形状为(n, m) 的矩阵。

此外，您还有k 更小的形状矩阵(s, m)。

这些k 矩阵可能是通过将较大的矩阵分片而产生的：

stride = z
ranges = [] # will contain sub lists of start / end positions
for i in range(0, n, stride):
    if i + s > n:
        ranges.append([n-s, n]) #<-- if not evenly divisible include last ragged bit
        break
    else:
        ranges.append([i, i+s])

# k = len(ranges)

for a, b in ranges:
    submat = mat[a:b] # <--- produces submats of shape (s, m)
    # not necessarily where submats come from, just for 
    # simple example purpose, feel free to add random noise to each submat

如何在 numpy 中加入这些 k 重叠子矩阵，平均重叠区域？

目标是然后采用这些子垫并改造原始垫，例如类似：

blank = np.zeros((n, m))
for i in range(len(submats)):
    a, b = ranges[i]
    blank[a:b] += submats[i] #<--- doesn't account for different amounts of overlapping regions

具体数字：

n = 693
m = 10
# so mat has shape (693, 10)

s = 500
stride = 50

ranges = [[0, 500], [50, 550], [100, 600], [150, 650], [193, 693]]
# notice that the range (0,50) doesn't need to be averaged

k = 5 # len(ranges)

# so we have k submats of shape (500, 10)

我目前正在这样做：

def count_overlap(max_len, ranges): # from example 693, and [[0, 500], ...]
    tally = np.zeros(max_len)
    for i in range(max_len):
        for a, b in ranges:
            if a <= i and i < b:
                tally[i] += 1
    return tally


olap = count_overlap(693, ranges)
olap[:55]
# ([1., 1., ..., 1., 2., 2., 2., 2., 2.])
olap[-50:]
# ([2., 2., 2., 2., 2., 2., 2., 2., 1., 1., ..., 1., 1., 1.])

要知道垫子的每个索引要除多少

【问题讨论】：

@Divakar 在帖子中，是的，他们都是(s, m)
@Divakar 形状与起始垫相同(n, m)
@Divakar 没有。您想从k、(s, m) 矩阵重新创建(n,m) 矩阵，其中第一维中k 矩阵之间存在重叠
通常你可能有多少个范围？如果它是一个小数字，那么在循环中切片和添加是有意义的，就像您已经拥有的那样。
@Divakar 我更新了问题以提供一些具体示例，k 通常为

标签： python numpy

【解决方案1】：

虽然在我自己的问题中，我提供了一种替代方法（而且不太优雅），但我并没有试图回答我自己的问题。相反，我只是将相关功能和解决方案包装起来供其他人使用：

助手

def shard_rng(maxlen, sublen, stride):
    ranges = [] 
    for i in range(0, maxlen, stride):
        if i + sublen > maxlen:
            ranges.append([maxlen-sublen, maxlen])
            break
        else:
            ranges.append([i, i+sublen])
    return ranges

# for testing. stitched_mat - mat should be 1
def split_mat(mat, ranges):
    submats = []
    for a,b in ranges:
        submats.append(mat[a:b] + 1)
    return submats

# part of solution 1
def weight_rngs(ranges):
    n = ranges[-1][-1]
    bins = map(np.bincount,np.array(ranges).T,(None,None),(n+1,n+1))
    vals = np.subtract(*bins).cumsum()
    weights = 1 / vals[:n,None]
    return weights

解决方案

由 @Paul Panzer

提供

# solution 1
def stitch_mats(shape, submats, ranges):    
    stitched = np.zeros(shape)
    weights = weight_rngs(ranges)
    for submat, (start, stop) in zip(submats, ranges):
        stitched[start:stop] += weights[start:stop] * submat        
    return stitched

# solution 2
def stitch_mats2(shape, submats, ranges):
    ranges = np.array(ranges)
    ro = ranges.ravel().argsort(kind='stable')

    # put 1 for starting and -1 for ending, take cumsum
    cnts = (1-((ro&1)<<1)).cumsum()

    stitched = np.zeros((n,m))
    # add slices
    for submat, (start, stop) in zip(submats,ranges):
        stitched[start:stop] += submat

    rs = ranges.ravel()[ro]
    # divide by overlap
    for start, stop, count in zip(rs[:-1],rs[1:],cnts[:-1]):
        stitched[start:stop] /= count
    return stitched

测试

n = 693
m = 10
s = 500 # sublen
stride = 50

mat = np.random.randint(0,10,(n,m))
ranges = shard_rng(n, s, stride)
submats = split_mat(mat, ranges)


stitched_1 = stitch_mats(mat.shape, submats, ranges)
stitched_2 = stitch_mats2(mat.shape, submats, ranges)

np.unique(stitched_1-mat-1.), np.unique(stitched_2-mat-1.)
# array([-8.8817842e-16,  0.0000000e+00,  4.4408921e-16,  8.8817842e-16]), array([0.])

【讨论】：

【解决方案2】：

这是一种使用bincount + cumsum逐行计算重叠的方法：

更新：添加了另一种仅使用切片的方法。我希望这通常会更快。

import numpy as np

n = 693
m = 10
# so mat has shape (693, 10)

s = 500
stride = 50

ranges = [[0, 500], [50, 550], [100, 600], [150, 650], [193, 693]]
# notice that the range (0,50) doesn't need to be averaged

k = 5 # len(ranges)

mat = np.random.randint(0,10,(n,m))
submats = []
for a, b in ranges:
    submats.append(mat[a:b])


ranges = np.asarray(ranges)
out = np.zeros((n,m))
# put a 1 at every start and a -1 at every stop
# then take the cumsum this will assign to each row the
# number of intervals it is in
# finally, take the reciprocal
weight = 1 / np.subtract(*map(np.bincount,ranges.T,(None,None),(n+1,n+1))).cumsum()[:n,None]
for sm,(a,b) in zip(submats,ranges):
    out[a:b] += weight[a:b] * sm


# method 2

# sort range ends
ro = ranges.ravel().argsort(kind='stable')
# put 1 for starting and -1 for ending, take cumsum
cnts = (1-((ro&1)<<1)).cumsum()
out = np.zeros((n,m))
# add slices
for sm,(a,b) in zip(submats,ranges):
    out[a:b] += sm
rs = ranges.ravel()[ro]
# divide by overlap
for a,b,c in zip(rs[:-1],rs[1:],cnts[:-1]):
    out[a:b] /= c

【讨论】：

我遇到了一个等效的，虽然不太优雅的解决方案：P 谢谢你的帮助。除了我们的方法中只有一个不同的索引...-43（你必须打电话给(1 / overlap)）我不太熟悉你做了什么，所以我需要弄清楚哪个是正确的
@SumNeuron 使用未修改的子矩阵切片，拼接后的版本out 最多四舍五入应该与mat 相同。也许您可以将其用作快速测试。
对不起np.where(weight.flatten() != (1/olap)) 导致(array([500, 550, 600, 650]),) 所以要么我重复计算垃圾箱的边缘，要么垃圾箱方法是。我不太理解你的意思out“应该四舍五入......”
对不起，我的方法高估了我认为的结果：P 和<=b
@SumNeuron 我的意思是np.allclose(mat,out) 应该返回True 作为示例。另外，我添加了另一种我认为应该更快的方法。请务必检查一下。