将 groupby 结果广播为原始 DataFrame 中的新列答案

【问题标题】：Broadcast groupby result as new column in original DataFrame将 groupby 结果广播为原始 DataFrame 中的新列
【发布时间】：2019-05-13 19:08:36
【问题描述】：

我正在尝试根据分组数据框中的两列在 Pandas 数据框中创建一个新列。

具体来说，我正在尝试复制此 R 代码的输出：

library(data.table)

df = data.table(a = 1:6, 
            b = 7:12,
            c = c('q', 'q', 'q', 'q', 'w', 'w')
            )


df[, ab_weighted := sum(a)/sum(b), by = "c"]
df[, c('c', 'a', 'b', 'ab_weighted')]

输出：

到目前为止，我在 Python 中尝试了以下操作：

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b']))

输出：

当我将上面代码中的 apply 更改为 transform 时，出现错误： TypeError：需要一个整数

如果我只使用一列，则转换工作正常：

import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

 df.groupby(['c'])['a', 'b'].transform(lambda x: sum(x))

但显然，这不是同一个答案：

有没有办法在 Pandas 中从我的 data.table 代码中获取结果，而无需生成中间列（因为这样我可以在最后一列上使用 transform？

非常感谢任何帮助:)

【问题讨论】：

标签： python pandas dataframe group-by pandas-groupby

【解决方案1】：

这很好用：

import numpy as np
import pandas as pd

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

def groupby_transform(df: pd.DataFrame, group_by_column: str, lambda_to_apply) -> np.array:
    """
    Groupby and transform. Returns a column for the original dataframe.
    :param df: Dataframe.
    :param group_by_column: Column(s) to group by.
    :param lambda_to_apply: Lambda.
    :return: Column to append to original dataframe.
    """
    df = df.reset_index(drop=True)  # Dataframe index is now strictly in order of the rows in the original dataframe.
    values = df.groupby(group_by_column).apply(lambda_to_apply)
    values.sort_index(level=1, inplace=True)  # Sorts result into order of original rows in dataframe (as groupby will undo that order when it groups).
    result = np.array(values)  # Sort rows into same order as original dataframe.
    if result.shape[0] == 1:  # e.g. if shape is (1,1003), make it (1003,).
        result = result[0]
    return result  # Return column.


df["result"] = groupby_transform(df, "c", lambda x: x["a"].shift(1) + x["b"].shift(1))

输出：

   a   b  c  result
0  1   7  q     NaN
1  2   8  q     8.0
2  3   9  q    10.0
3  4  10  q    12.0
4  5  11  w     NaN
5  6  12  w    16.0

和上面一样Pandas extension:

@pd.api.extensions.register_dataframe_accessor("ex")
class GroupbyTransform:
    """
    Groupby and transform. Returns a column for the original dataframe.
    """
    def __init__(self, pandas_obj):
        self._validate(pandas_obj)
        self._obj = pandas_obj

    @staticmethod
    def _validate(obj):
        # TODO: Check that dataframe is sorted, throw if not.
        pass

    def groupby_transform(self, group_by_column: str, lambda_to_apply):
        """
        Groupby and transform. Returns a column for the original dataframe.
        :param df: Dataframe.
        :param group_by_column: Column(s) to group by.
        :param lambda_to_apply: Lambda.
        :return: Column to append to original dataframe.
        """
        df = self._obj.reset_index(drop=True)  # Dataframe index is now strictly in order of the rows in the original dataframe.
        values = df.groupby(group_by_column).apply(lambda_to_apply)
        values.sort_index(level=1, inplace=True)  # Sorts result into order of original rows in dataframe (as groupby will undo that order when it groups).
        result = np.array(values)
        if result.shape[0] == 1:  # e.g. if shape is (1,1003), make it (1003,).
            result = result[0]
        return result

这给出了与以前相同的输出：

df["result"] = df.ex.groupby_transform("c", lambda x: x["a"].shift(1) + x["b"].shift(1))

【讨论】：

【解决方案2】：

2021-03-28 更新：我不推荐这个答案；我会推荐我的另一个，因为它更清洁、更高效。

试试@BENY 的答案。如果不起作用，可能是由于不同的索引。

下面的解决方案很丑陋，而且更复杂，但它应该提供足够的线索来让它与 any 数据框一起工作，而不仅仅是玩具数据框。这是 pandas 的一个领域，其中 API 无疑是笨拙且容易出错的，有时根本没有干净的方法来获得任何有效的结果，而无需大量跳跃。

诀窍是确保公共索引可用并具有相同的名称。

df = pd.DataFrame({'a':[1,2,3,4,5,6],
               'b':[7,8,9,10,11,12],
               'c':['q', 'q', 'q', 'q', 'w', 'w']
              })

df.reset_index(drop=True, inplace=True)

values = df.groupby(['c']).apply(lambda x: sum(x['a'])/sum(x['b']))
# Convert result to dataframe.
df_to_join = values.to_frame()

# Ensure indexes have common names.
df_to_join.index.set_names(["index"], inplace=True)
df.set_index("c", inplace=True)
df.index.set_names(["index"], inplace=True)

# Set column name of result we want.
df_to_join.rename(columns={0: "ab_weighted"}, inplace=True, errors='raise')

# Join result of groupby to original dataframe.
df_result = df.merge(df_to_join, on=["index"])
print(df_result)

# output 
       a   b  ab_weighted
index                    
q      1   7     0.294118
q      2   8     0.294118
q      3   9     0.294118
q      4  10     0.294118
w      5  11     0.478261
w      6  12     0.478261

并将索引转换回列c：

df_result.reset_index(inplace=True)
df_result.rename(columns={"index": "c"}, inplace=True)

【讨论】：

【解决方案3】：

这也可以。我不知道为什么，但如果我让应用返回一个系列而不是一个数据框，我会得到一个错误。

df['ab_weighted'] = \
df.groupby('c', group_keys = False)['a', 'b'].apply(
    lambda x: pd.Series(x.a.sum()/x.b.sum(), 
                        index = x.index).to_frame()
).iloc[:,0]
print(df)

# output 
#    a   b  c  ab_weighted
# 0  1   7  q     0.294118
# 1  2   8  q     0.294118
# 2  3   9  q     0.294118
# 3  4  10  q     0.294118
# 4  5  11  w     0.478261
# 5  6  12  w     0.478261

【讨论】：

小心这个 - 我更喜欢@BENY的答案或我的答案，因为如果 groupby 中有多个类别，则使用 iloc[:,0] 会打乱结果，这意味着顺序结果与输入不同。需要加入或进行后排序来解决此问题。

【解决方案4】：

你只有一步之遥。

v = df.groupby('c')[['a', 'b']].transform('sum')
df['ab_weighted'] = v.a / v.b

df
   a   b  c  ab_weighted
0  1   7  q     0.294118
1  2   8  q     0.294118
2  3   9  q     0.294118
3  4  10  q     0.294118
4  5  11  w     0.478261
5  6  12  w     0.478261

【讨论】：

我喜欢这种方法，但是您创建了整个数据框的副本，并且对于大型数据集，这可能会很昂贵。我正在寻找一种仅创建附加列而不复制数据框或保存中间结果的方法（data.table 仅为附加列分配内存而不复制数据框）谢谢，节日快乐:)
@Christoph 哦，我的错。那只是一个代表性的例子。只需这样做：df['new'] = v.a / v.b
嘿，我怕你把我弄丢了：）当你做df['new'] = v.a / v.b时，如果我理解正确的话，你仍然需要创建中间数据帧v？
@Christoph 错误，不。这是就地赋值，语法，所以它比 DataFrame.assign （复制数据）效率更高。
是的，但是您仍然使用v = df.groupby('c')[['a', 'b']].transform('sum') 创建数据框v，对吗？恐怕我的咖啡还没喝完：）

【解决方案5】：

仅使用 map、R 和 pandas 修复您的代码仍然有不同，这意味着并非每个 R 函数都可以在 pandas 中找到替代品

df.c.map(df.groupby(['c'])['a', 'b'].apply(lambda x: sum(x['a'])/sum(x['b'])))
Out[67]: 
0    0.294118
1    0.294118
2    0.294118
3    0.294118
4    0.478261
5    0.478261
Name: c, dtype: float64

【讨论】：

谢谢，这回答了我的具体示例，但是您将如何处理多个分组，例如 `df.groupby(['c', 'd']) ？
然后检查加入并合并@Christoph