在 numpy 数组中按最大值或最小值分组答案

【问题标题】：Group by max or min in a numpy array在 numpy 数组中按最大值或最小值分组
【发布时间】：2011-12-24 06:09:57
【问题描述】：

我有两个等长的 1D numpy 数组，id 和 data，其中id 是一个重复的有序整数序列，用于定义data 上的子窗口。例如：

我想通过在 id 上分组并取最大值或最小值来聚合 data。

在 SQL 中，这将是一个典型的聚合查询，例如 SELECT MAX(data) FROM tablename GROUP BY id ORDER BY id。

有没有办法可以避免 Python 循环并以矢量化方式执行此操作？

【问题讨论】：

标签： python python-3.x numpy group-by

【解决方案1】：

在过去的几天里，我看到了一些关于堆栈溢出的非常相似的问题。下面的代码与 numpy.unique 的实现非常相似，因为它利用了底层的 numpy 机制，所以它很可能比你在 python 循环中所做的任何事情都要快。

import numpy as np
def group_min(groups, data):
    # sort with major key groups, minor key data
    order = np.lexsort((data, groups))
    groups = groups[order] # this is only needed if groups is unsorted
    data = data[order]
    # construct an index which marks borders between groups
    index = np.empty(len(groups), 'bool')
    index[0] = True
    index[1:] = groups[1:] != groups[:-1]
    return data[index]

#max is very similar
def group_max(groups, data):
    order = np.lexsort((data, groups))
    groups = groups[order] #this is only needed if groups is unsorted
    data = data[order]
    index = np.empty(len(groups), 'bool')
    index[-1] = True
    index[:-1] = groups[1:] != groups[:-1]
    return data[index]

【讨论】：

【解决方案2】：

在纯 Python 中：

from itertools import groupby, imap, izip
from operator  import itemgetter as ig

print [max(imap(ig(1), g)) for k, g in groupby(izip(id, data), key=ig(0))]
# -> [7, 10, 1]

变体：

print [data[id==i].max() for i, _ in groupby(id)]
# -> [7, 10, 1]

基于@Bago's answer：

import numpy as np

# sort by `id` then by `data`
ndx = np.lexsort(keys=(data, id))
id, data = id[ndx], data[ndx]

# get max()
print data[np.r_[np.diff(id), True].astype(np.bool)]
# -> [ 7 10  1]

如果安装了pandas：

from pandas import DataFrame

df = DataFrame(dict(id=id, data=data))
print df.groupby('id')['data'].max()
# id
# 1    7
# 2    10
# 3    1

【讨论】：

【解决方案3】：

我对 Python 和 Numpy 还很陌生，但您似乎可以使用 ufuncs 的 .at 方法而不是 reduceat：

import numpy as np
data_id = np.array([0,0,0,1,1,1,1,2,2,2,3,3,3,4,5,5,5])
data_val = np.random.rand(len(data_id))
ans = np.empty(data_id[-1]+1) # might want to use max(data_id) and zeros instead
np.maximum.at(ans,data_id,data_val)

例如：

data_val = array([ 0.65753453,  0.84279716,  0.88189818,  0.18987882,  0.49800668,
    0.29656994,  0.39542769,  0.43155428,  0.77982853,  0.44955868,
    0.22080219,  0.4807312 ,  0.9288989 ,  0.10956681,  0.73215416,
    0.33184318,  0.10936647])
ans = array([ 0.98969952,  0.84044947,  0.63460516,  0.92042078,  0.75738113,
    0.37976055])

当然，这只有在您的 data_id 值适合用作索引时才有意义（即非负整数而不是巨大的...大概如果它们很大/稀疏，您可以使用 np.unique(data_id) 初始化 ans什么的）。

我应该指出，data_id 实际上不需要排序。

【讨论】：

【解决方案4】：

我已经在numpy_indexed 包中打包了我之前答案的一个版本；很高兴将所有这些都包含在一个整洁的界面中并进行测试；此外，它还具有更多功能：

import numpy_indexed as npi
group_id, group_max_data = npi.group_by(id).max(data)

等等

【讨论】：

【解决方案5】：

只有 numpy 而没有循环：

id = np.asarray([1,1,1,2,2,2,3,3])
data = np.asarray([2,7,3,8,9,10,1,-10])

# max
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_max = np.maximum.reduceat(data[_ndx], _pos)

# min
_ndx = np.argsort(id)
_id, _pos  = np.unique(id[_ndx], return_index=True)
g_min = np.minimum.reduceat(data[_ndx], _pos)

# compare results with pandas groupby
np_group = pd.DataFrame({'min':g_min, 'max':g_max}, index=_id)
pd_group = pd.DataFrame({'id':id, 'data':data}).groupby('id').agg(['min','max'])

(pd_group.values == np_group.values).all()  # TRUE

【讨论】：

【解决方案6】：

我认为这可以满足您的要求：

[max([val for idx,val in enumerate(data) if id[idx] == k]) for k in sorted(set(id))]

对于外部列表理解，从右到左，set(id) 将ids 分组，sorted() 对它们进行排序，for k ... 迭代它们，max 取最大值，在这种情况下，另一个列表理解。所以转到内部列表理解：enumerate(data) 从data 返回索引和值，if id[val] == k 挑选出与data 对应的data 成员idk。

这将遍历每个id 的完整data 列表。通过对子列表进行一些预处理，也许可以加快速度，但不会是单行的。

【讨论】：

【解决方案7】：

以下解决方案只需要对数据进行排序（不是 lexsort），并且不需要找到组之间的边界。它依赖于这样一个事实：如果o 是r 的索引数组，那么r[o] = x 将用o 的每个值的最新值x 填充r，这样r[[0, 0]] = [1, 2] 将返回r[0] = 2。它要求您的组是从 0 到组数 - 1 的整数，如 numpy.bincount，并且每个组都有一个值：

def group_min(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)[::-1]
    result[groups.take(order)] = data.take(order)
    return result

def group_max(groups, data):
    n_groups = np.max(groups) + 1
    result = np.empty(n_groups)
    order = np.argsort(data)
    result[groups.take(order)] = data.take(order)
    return result

【讨论】：

【解决方案8】：

比已经接受的答案稍快和更普遍的答案；就像 joeln 的答案一样，它避免了更昂贵的 lexsort，它适用于任意 ufunc。此外，它只要求键是可排序的，而不是特定范围内的整数。考虑到最大/最小值没有明确计算，接受的答案可能仍然更快。忽略已接受解决方案的 nan 的能力很巧妙；但也可以简单地为 nan 值分配一个虚拟键。

import numpy as np

def group(key, value, operator=np.add):
    """
    group the values by key
    any ufunc operator can be supplied to perform the reduction (np.maximum, np.minimum, np.substract, and so on)
    returns the unique keys, their corresponding per-key reduction over the operator, and the keycounts
    """
    #upcast to numpy arrays
    key = np.asarray(key)
    value = np.asarray(value)
    #first, sort by key
    I = np.argsort(key)
    key = key[I]
    value = value[I]
    #the slicing points of the bins to sum over
    slices = np.concatenate(([0], np.where(key[:-1]!=key[1:])[0]+1))
    #first entry of each bin is a unique key
    unique_keys = key[slices]
    #reduce over the slices specified by index
    per_key_sum = operator.reduceat(value, slices)
    #number of counts per key is the difference of our slice points. cap off with number of keys for last bin
    key_count = np.diff(np.append(slices, len(key)))
    return unique_keys, per_key_sum, key_count


names = ["a", "b", "b", "c", "d", "e", "e"]
values = [1.2, 4.5, 4.3, 2.0, 5.67, 8.08, 9.01]

unique_keys, reduced_values, key_count = group(names, values)
print 'per group mean'
print reduced_values / key_count
unique_keys, reduced_values, key_count = group(names, values, np.minimum)
print 'per group min'
print reduced_values
unique_keys, reduced_values, key_count = group(names, values, np.maximum)
print 'per group max'
print reduced_values

【讨论】：