【发布时间】:2016-09-20 12:51:15
【问题描述】:
在 numpy / scipy(或纯 python,如果您愿意)中,将 numpy 数组中的连续区域分组并计算这些区域的长度的好方法是什么?
类似这样的:
x = np.array([1,1,1,2,2,3,0,0,0,0,0,1,2,3,1,1,0,0,0])
y = contiguousGroup(x)
print y
>> [[1,3], [2,2], [3,1], [0,5], [1,1], [2,1], [3,1], [1,2], [0,3]]
我尝试仅使用循环来执行此操作,但是它需要比我想要的更长的时间(6 秒)来创建一个包含大约 3000 万个样本和 20000 个连续区域的列表。
编辑:
现在进行一些速度比较(仅使用 time.clock() 和几百次迭代,如果以秒为单位,则更少)。
首先我的 python 循环代码在 5 个样本上进行了测试。
Number of elements 33718251
Number of regions 135137
Time taken = 8.644007 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 10.533305 seconds...
Number of elements 21841302
Number of regions 7619335
Time taken = 7.671015 seconds...
Number of elements 19723928
Number of regions 10799
Time taken = 5.014807 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 4.207359 seconds...
现在有了 Divakar 的矢量化解决方案。
Number of elements 33718251
Number of regions 135137
Time taken = 0.063470 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 0.046293 seconds...
Number of elements 21841302
Number of regions 7619335
Time taken = 1.654288 seconds...
Number of elements 19723928
Number of regions 10799
Time taken = 0.022651 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 0.021189 seconds...
修改后的方法给出的时间大致相同(在最坏的情况下可能慢 5%)
现在使用 Kasramvd 的生成器方法。
Number of elements 33718251
Number of regions 135137
Time taken = 3.834922 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 4.785480 seconds...
Number of elements 21841302
Number of regions 7619335
Time taken = 6.806867 seconds...
Number of elements 19723928
Number of regions 10799
Time taken = 2.264413 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 1.778873 seconds...
现在是他的 numpythonic 版本。
Number of elements 33718251
Number of regions 135137
Time taken = 0.286336 seconds...
Number of elements 42503100
Number of regions 6985
Time taken = 0.174769 seconds...
Memory error sample 3 (too many regions)
Number of elements 19723928
Number of regions 10799
Time taken = 0.087028 seconds...
Number of elements 16619539
Number of regions 19293
Time taken = 0.084963 seconds...
无论如何,我认为这个故事的寓意是 numpy 非常好。
【问题讨论】:
-
开始你在第 1 行缺少一个结束括号。
-
您能否让我们知道您使用建议的解决方案可能会获得什么样的加速(如果有)?
-
当然,我会将您的解决方案与我的解决方案和其他解决方案进行比较。
标签: python arrays performance numpy