有效地查找 2D numpy 数组中正值的索引范围答案

【问题标题】：Efficiently finding range of indices for positive values in 2D numpy array有效地查找 2D numpy 数组中正值的索引范围
【发布时间】：2015-06-18 14:47:06
【问题描述】：

我有一个大的 numpy 数组（通常为 500,000x1024，但可以更大），我正在尝试执行几个过程，这些过程取决于数组中正值的位置。一个非常小的示例数组可能是

  [[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
   [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [ 0., 1., 1., 0., 0., 1., 5., 0., 0.],
   [ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
   [ 0., 3., 1., 0., 0., 2., 1., 0., 0.],
   [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   [ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
   [ 0., 0., 0., 0., 0., 0., 0., 0., 0.]]

第一个是替换每行中相距小于三列的正值之间的任何零。所以如果我用 50 替换这些数字，我的示例输出将是

 [[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
  [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
  [ 0., 1., 1.,50.,50., 1., 5., 0., 0.],
  [ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
  [ 0., 3., 1.,50.,50., 2., 1., 0., 0.],
  [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
  [ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
  [ 0., 0., 0., 0., 0., 0., 0., 0., 0.]]

我需要做的第二件事是根据正值的范围为每一行写出一些信息。例如，使用我更改的数组，我需要能够为第三行写出一条语句，为 col[1:7] 声明正整数，为第四行写两条语句，在 col[1:3] 和 col 中声明正整数[6]。

我已经设法利用 numpy 向量化方法来解决第一个任务，但最终还是求助于循环遍历列和行（尽管在整个数组的子集上）。否则，我最终会替换给定行中的所有零，而不仅仅是正值之间的零。

但第二个任务我似乎找不到不使用循环遍历整个数组的方法

for col in arr:
  for row in arr:

我想我的总体问题是，有没有办法利用 numpy 中的矢量化方法来定义每行不同的列索引范围并取决于以下列中的值？

任何帮助将不胜感激。

【问题讨论】：

标签： python arrays numpy multidimensional-array

【解决方案1】：

不幸的是，Numpy 不能在不生成更多数组的情况下进行大量处理，所以我担心任何解决方案都需要像您一直在使用的某种形式的手动循环，或者创建一个或多个额外的大数组。您可以使用 numexpr 提出一个非常快速且内存高效的解决方案。

这里有一个尝试这样做的方式不一定是内存效率，但至少所有循环都将由 Numpy 完成，所以应该比你一直在做的要快得多，只要它适合在你的记忆中。（通过将其中的一些重写为就地操作可能会提高内存效率，但我不会担心。）

这是您的第 1 步：

positive = x>0 # a boolean array marking the positive values in x

positive0 = positive[:,0:-3] # all but last 3 columns 
positive1 = positive[:,1:-2] # all but 1st and last 2 columns; not actually used
positive2 = positive[:,2:-1] # all but first 2 and last 1 columns
positive3 = positive[:,3:  ] # all but first 3 columns

# In the following, the suffix 1 indicates that we're viewing things from the perspective
# of entries in positive1 above.  So, e.g., has_pos_1_to_left1 will be True at
# any position where an entry in positive1 would be preceded by a positive entry in x

has_pos_1_to_left1 = positive0
has_pos_1_or_2_to_right1 = positive2 | positive3
flanked_by_positives1 = has_pos_1_to_left1 & has_pos_1_or_2_to_right1

zeros = (x == 0)       # indicates everywhere x is 0
zeros1 = zeros[:,1:-2] # all but 1st and last 2 columns

x1 = x[:,1:-2]         # all but 1st and last 2 columns

x1[zeros1 & flanked_by_positives1] = 50 # fill in zeros that were flanked - overwrites x!

# The preceding didn't address the next to last column, b/c we couldn't
# look two slots to the right of it without causing error.  Needs special treatment:
x[:,-2][ zeros[:,-2] & positive[:,-1] & (positive[:,-4] or positive[:,-3])] = 50

这是您的第 2 步：

filled_positives = x>0 # assuming we just filled in x
diffs = numpy.diff(filled_positives) # will be 1 at first positive in any sequence,
                                     # -1 after last positive, zero elsewhere

endings = numpy.where(diffs==-1) # tuple specifying coords where positive sequences end 
                                 # omits final column!!!
beginnings = numpy.where(diffs==1) # tuple specifying coords where pos seqs about to start
                                   # omits column #0!!!

使用这些开始和结束坐标来提取您说需要的每一行的信息应该很简单，但请记住，这种差异检测方法仅捕获从非肯定到肯定的转换，反之亦然，因此它不会提及从第零列开始或在最后一列结束的正序列，因此如果需要，您需要单独查找这些非转换。

【讨论】：

【解决方案2】：

您可以使用高效的 numpy 迭代器，例如 flatiter 或 nditer

例如，对于您的第二个任务

In [1]: x = array([[ 0., 0., 0., 0., 0.,-1.,-1., 0., 0.],
   ...:            [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   ...:            [ 0., 1., 1.,50.,50., 1., 5., 0., 0.],
   ...:            [ 0., 1., 1., 0., 0., 0., 1., 0., 0.],
   ...:            [ 0., 3., 1.,50.,50., 2., 1., 0., 0.],
   ...:            [ 0., 0., 0., 0., 0., 0., 0., 0., 0.],
   ...:            [ 0., 1., 0., 0., 0., 1., 1., 0., 0.],
   ...:            [ 0., 0., 0., 0., 0., 0., 0., 0., 0.]])

In [2]: islands = []
   ...: fl = x.flat
   ...: while fl.index < x.size:
   ...:     coord = fl.coords
   ...:     if fl.next() > 0:
   ...:         length = 1
   ...:         while fl.next() > 0:
   ...:             length +=1
   ...:         islands.append([coord, length])

In [3]: for (row, col), length in islands:
   ...:     print 'row:%d ; col[%d:%d]' %(row, col, col+length)
row:2 ; col[1:7]
row:3 ; col[1:3]
row:3 ; col[6:7]
row:4 ; col[1:7]
row:6 ; col[1:2]
row:6 ; col[5:7]

【讨论】：

【解决方案3】：

对于你的第一个问题：创建一个变量来保存你遇到的第一个正数的索引，并有一个 if 语句在下一个值为正数时重置位置并且计数（变量从第一个正数算起的位置）小于 3。

第二个问题：创建一个数组并添加正值位置的索引。

 String[] indices = new String[];
 int pos = 0;
 for col in arr:
     for row in arr:
        if(index is positive){
             indices[pos] = "[" + col + ":" + row + "]";
             pos++;
         }

【讨论】：

感谢您的回答，但这仍然需要使用 for 循环来循环遍历每一列和每一行，这正是我试图避免做的事情。我的数组很大，这需要很多时间。我希望有一种方法可以使用不需要遍历数组的内置函数来做到这一点。
你是如何创建数组的？从技术上讲，您可以创建一个对象数组列表，其中包含索引、值以及它是否为正。然后你可以使用 forloop 来抓取并返回所有你想要的。此解决方案的时间为 O(N)。假设您一开始没有使用嵌入式 forloop 创建数组。
数组创建完全独立进行，但它们实际上代表了一种掩码，用于保存在相同形状的单独数组中的实际数据。
对数据运行各种进程，结果会更新此“掩码”数组。一旦完成所有这些，它就会传递给将运行上述操作的进程。您能否举例说明您的建议将如何发挥作用？我不确定我是否真的可以按照您的建议创建原始数组，因为它们与数据的关系。数据结构，即 2D 数组也代表了有关数据本身的信息。它是频率 x 时间，当我写出陈述时，它将包括所需的频率和时间。
第二种方法是让数据创建对象，所以假设你有一个类：

【解决方案4】：

第二种方法是让数据创建对象，所以假设你有一个类：

public class Matrix{
   int indicex;
   int indicey;
   double val;
   boolean positiveInt;

   //default constructor
   public Matrix(int indicex, int indicey, double val, boolean positiveInt){
   this.indicex = indicex;
   this.indicey = indicey;
   this.val = val;
   this.positiveInt = positiveInt;
   }    

   //getter
   public boolean isPositive(){
        if(positiveInt == true){
              return true;
        }else{
            return false;
        }

然后在您的驱动程序类中，您将读取数据并创建一个对象 new Matrix(indexx, indexy, val, true/false)...。然后将其放入您可以搜索的数组列表中为正数。

List<Matrix> storeObjects = new ArrayList<Matrix>();
some method(){
   Matrix matrixObject = new Matrix(indexx, indexy, val, trueOrFalse);
   storeObjects.add(matrixObject)
 }

 for every object in store objects 
    if(object.isPositive()){
         put object in a separate array of positive objects
     }
  }

【讨论】：

这是有道理的，仍然需要对最终数组进行相当多的操作才能得到我需要在最终语句中写入的每行上的哪些列（即每次的频率） .不过，我对该建议的主要查询是因为我无法使用初始数据数组创建此矩阵，据我所知，我最终仍然必须对整个数组进行逐元素循环以将其解释为首先矩阵列表？