groupby 上的 pandas concat 数组答案

【问题标题】：pandas concat arrays on groupbygroupby 上的 pandas concat 数组
【发布时间】：2015-12-03 09:38:50
【问题描述】：

我有一个由 group by 创建的 DataFrame：

agg_df = df.groupby(['X', 'Y', 'Z']).agg({
    'amount':np.sum,
    'ID': pd.Series.unique,
})

在我对agg_df 应用一些过滤后，我想连接 ID

agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
    'amount':np.sum,
    'ID': pd.Series.unique,
})

但我在第二个'ID': pd.Series.unique 收到错误：

ValueError: Function does not reduce

例如，第二个 groupby 之前的数据框是：

               |amount|  ID   |
-----+----+----+------+-------+
  X  | Y  | Z  |      |       |
-----+----+----+------+-------+
  a1 | b1 | c1 |  10  | 2     |
     |    | c2 |  11  | 1     |
  a3 | b2 | c3 |   2  | [5,7] |
     |    | c4 |   7  | 3     |
  a5 | b3 | c3 |  12  | [6,3] |
     |    | c5 |  17  | [3,4] |
  a7 | b4 | c6 |  2   | [8,9] |

而预期的结果应该是

          |amount|  ID       |
-----+----+------+-----------+
  X  | Y  |      |           |
-----+----+------+-----------+
  a1 | b1 |  21  | [2,1]     |
  a3 | b2 |   9  | [5,7,3]   |
  a5 | b3 |  29  | [6,3,4]   |
  a7 | b4 |  2   | [8,9]     |

最终 ID 的顺序并不重要。

编辑： 我想出了一个解决方案。但它不是很优雅：

def combine_ids(x):
   def asarray(elem):
      if isinstance(elem, collections.Iterable):
         return np.asarray(list(elem))
      return elem

   res = np.array([asarray(elem) for elem in x.values])
   res = np.unique(np.hstack(res))
   return set(res)

agg_df = agg_df.groupby(['X', 'Y']).agg({ # Z is not in in groupby now
    'amount':np.sum,
    'ID': combine_ids,
})

编辑2： 另一个适用于我的解决方案是：

combine_ids = lambda x: set(np.hstack(x.values))

编辑3： 由于 Pandas 聚合函数的实现，似乎无法避免 set() 作为结果值。详情在https://stackoverflow.com/a/16975602/3142459

【问题讨论】：

你可以找到一些more recipes for flattening (arbitrarily deeply nested) sequences here。
据我所知，您无法从聚合方法返回列表或数组

标签： python pandas

【解决方案1】：

如果您可以使用集合作为您的类型（我可能会这样做），那么我会选择：

agg_df = df.groupby(['x','y','z']).agg({
    'amount': np.sum, 'id': lambda s: set(s)})
agg_df.reset_index().groupby(['x','y']).agg({
    'amount': np.sum, 'id': lambda s: set.union(*s)})

...这对我有用。出于某种原因，lambda s: set(s) 有效，但 set 无效（我猜 pandas 在某处没有正确进行鸭子打字）。

如果您的数据很大，您可能需要以下内容而不是 lambda s: set.union(*s)：

from functools import reduce
# can't partial b/c args are positional-only
def cheaper_set_union(s):
    return reduce(set.union, s, set())

【讨论】：

【解决方案2】：

当您的聚合函数返回一个系列时，pandas 不一定知道您希望将它打包到一个单元格中。作为更通用的解决方案，只需将结果显式强制转换为列表即可。

agg_df = df.groupby(['X', 'Y', 'Z']).agg({
    'amount':np.sum,
    'ID': lambda x: list(x.unique()),
})

【讨论】：