Python Pandas - 将 groupby 函数的结果返回到父表答案

【问题标题】：Python Pandas - returning results of groupby function back to parent tablePython Pandas - 将 groupby 函数的结果返回到父表
【发布时间】：2013-07-07 03:22:27
【问题描述】：

[使用 Python3] 我正在使用 pandas 读取 csv 文件，对数据帧进行分组，对分组数据应用函数并将这些结果添加回原始数据帧。

我的输入如下所示：

email                   cc  timebucket  total_value
john@john.com           us  1           110.50
example@example.com     uk  3           208.84
...                     ... ...         ...

基本上，我试图按cc 分组，并计算该组内total_value 中每个值的百分位排名。其次，我想对这些结果应用一个流程语句。我需要将这些结果添加回原始/父 DataFrame。这样它看起来像这样：

email                   cc  timebucket  total_value     percentrank rankbucket
john@john.com           us  1           110.50          48.59       mid50
example@example.com     uk  3           208.84          99.24       top25
...                     ... ...         ...             ...         ...

下面的代码给了我一个AssertionError，我不知道为什么。我对 Python 和 pandas 很陌生，所以这可能会解释一个又一个。

代码：

import pandas as pd
import numpy as np
from scipy.stats import rankdata

def percentilerank(frame, groupkey='cc', rankkey='total_value'):
    from pandas.compat.scipy import percentileofscore

    # Technically the below percentileofscore function should do the trick but I cannot
    # get that to work, hence the alternative below. It would be great if the answer would
    # include both so that I can understand why one works and the other doesnt.
    # func = lambda x, score: percentileofscore(x[rankkey], score, kind='mean')

    func = lambda x: (rankdata(x.total_value)-1)/(len(x.total_value)-1)*100
    frame['percentrank'] = frame.groupby(groupkey).transform(func)


def calc_and_write(filename):
    """
    Function reads the file (must be tab-separated) and stores in a pandas DataFrame.
    Next, the percentile rank score based is calculated based on total_value and is done so within a country.
    Secondly, based on the percentile rank score (prs) a row is assigned to one of three buckets:
        rankbucket = 'top25' if prs > 75
        rankbucket = 'mid50' if 25 > prs < 75
        rankbucket = 'bottom25' if prs < 25
    """

    # Define headers for pandas to read in DataFrame, stored in a list
    headers = [
        'email',            # 0
        'cc',               # 1
        'last_trans_date',  # 3
        'timebucket',       # 4
        'total_value',      # 5
    ]

    # Reading csv file in chunks and creating an iterator (is supposed to be much faster than reading at once)
    tp = pd.read_csv(filename, delimiter='\t', names=headers, iterator=True, chunksize=50000)
    # Concatenating the chunks and sorting total DataFrame by booker_cc and total_nett_spend
    df = pd.concat(tp, ignore_index=True).sort(['cc', 'total_value'], ascending=False)

    percentilerank(df)

编辑：根据要求，这是回溯日志：

Traceback (most recent call last):
  File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 85, in <module>
    print(calc_and_write('tsv/test.tsv'))
  File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 74, in calc_and_write
    percentilerank(df)
  File "C:\Users\m\Documents\Python\filter_n_split_3.py", line 33, in percentilerank
    frame['percentrank'] = frame.groupby(groupkey).transform(func)
  File "C:\Python33\lib\site-packages\pandas\core\groupby.py", line 1844, in transform
    axis=self.axis, verify_integrity=False)
  File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 894, in concat
    verify_integrity=verify_integrity)
  File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 964, in __init__
    self.new_axes = self._get_new_axes()
  File "C:\Python33\lib\site-packages\pandas\tools\merge.py", line 1124, in _get_new_axes
    assert(len(self.join_axes) == ndim - 1)
AssertionError

【问题讨论】：

什么 AssertionError？您能否包含整个堆栈跟踪（包括行号，以及这对应于您的代码中的哪一行）？
嗨，安迪，我已经添加了 Traceback 日志，希望这更有意义。
所以它源于frame.groupby(groupkey).transform(func)...

标签： python csv python-3.x pandas

【解决方案1】：

试试这个。您的示例从转换函数返回一个系列，但应该返回一个值。（这使用熊猫等级功能仅供参考）

In [33]: df
Out[33]: 
                 email  cc  timebucket  total_value
0        john@john.com  us           1       110.50
1  example@example.com  uk           3       208.84
2          foo@foo.com  us           2        50.00

In [34]: df.groupby('cc')['total_value'].apply(lambda x: 100*x.rank()/len(x))
Out[34]: 
0    100
1    100
2     50
dtype: float64

In [35]: df['prank'] = df.groupby('cc')['total_value'].apply(lambda x: 100*x.rank()/len(x))

In [36]: df
Out[36]: 
                 email  cc  timebucket  total_value  prank
0        john@john.com  us           1       110.50    100
1  example@example.com  uk           3       208.84    100
2          foo@foo.com  us           2        50.00     50

【讨论】：

嗨，杰夫，感谢您的回答。我将不得不进一步研究它，但看起来它可以解决问题！顺便说一句，您知道为什么 percentileofscore 不能这样工作吗？
你的函数 percentileofscore 需要 2 个参数，因此它不能与转换（需要 1 个）兼容。但我不确定score 会是什么...
percentileofscore 有两个参数：一个排序的值列表和实际的行值。我认为像 percentileofscore([v for v in df['total_value'], df['total_value']) 这样的东西可以解决问题，但事实并非如此。