如何分组日期，找到最大值并将其附加到数据框？答案

【问题标题】：How to groupby date, find the max, and append it to the dataframe?如何分组日期，找到最大值并将其附加到数据框？
【发布时间】：2020-12-27 05:37:58
【问题描述】：

我有每个国家每天的 COVID 数据的数据框。
我想将行添加到该数据框中，在数据框中的每一天，每天的最大值。
- 这是为了帮助我构建具有正确最大值的 dcc 滑块
这是我正在尝试做的一个示例（但它不起作用）：

today = pd.Timestamp.today()

df = pd.DataFrame([['china',today,1,4,7],
                     ['america',today,2,5,8], 
                     ['china',date.today() - timedelta(days=1),3,6,9], 
                     ['india',date.today() - timedelta(days=2),4,7,10]], 

                     columns=['country','date', 'a','b','c'])

print('-----dataframe BEFORE appending rows with daily max values-----')
print(df)

for i in df['date'].unique():
    print('-----new iteration-----')
    print(i) # correct
    temp = df.loc[df['date'] == i]
    max_a = temp['a'].max()
    max_b = temp['b'].max()
    max_c = temp['c'].max()
    new_df = pd.DataFrame([['daily max',i,max_a,max_b,max_c]], columns=['country','date', 'a','b','c'])
                                 
    print('-----new line to be added to the dataframe-----')
    print(new_df) # correct

    df.append(new_df) # isn't working

    print('-----end of iteration-----')

print(df) # printing the same as the original dataframe :(

这就是它给我的：

我已尝试添加 , ignore_index=True) 和 , ignore_index=False)，但它们都不起作用。

【问题讨论】：

标签： python pandas dataframe for-loop append

【解决方案1】：

请参阅底部的真实 COVID 数据示例。
陪伴Jupyter Notebook
以这种方式使用for-loop，与pandas 一起使用是anti-pattern，并且比内置的矢量化方法慢得多。
使用pandas.DataFrame.groupby 更容易做到这一点。
- 使用groupby 查找所有列的所有每日最大值，然后concat 使用df 查找结果
使用 groupby 是一种更快的矢量化方法来解决此问题。

import pandas as pd
from datetime import date, timedelta

today = pd.Timestamp.today()

# note that the 8 and 7 for china and america are swapped for testing
df = pd.DataFrame([['china',today,1,4,8],  
                     ['america',today,2,5,7], 
                     ['china',date.today() - timedelta(days=1),3,6,9], 
                     ['india',date.today() - timedelta(days=2),4,7,10]], 

                     columns=['country','date', 'a','b','c'])

# find the daily max: 1 line of fast code compared to 7 lines of a for-loop
daily_max = df.groupby('date', as_index=False)[['a', 'b', 'c']].max()

# add column with daily_max
daily_max['country'] = 'daily max'

# combine with df
df_updated = pd.concat([df, daily_max]).sort_values(['date', 'country']).reset_index(drop=True)

# display(df_updated)

     country                       date  a  b   c
0  daily max 2020-09-06 00:00:00.000000  4  7  10
1      india 2020-09-06 00:00:00.000000  4  7  10
2      china 2020-09-07 00:00:00.000000  3  6   9
3  daily max 2020-09-07 00:00:00.000000  3  6   9
4    america 2020-09-08 14:38:20.382794  2  5   7
5      china 2020-09-08 14:38:20.382794  1  4   8
6  daily max 2020-09-08 14:38:20.382794  2  5   8

另一种方法是添加一列布尔值来选择每日最大值。
这将使单个滑块最多可用于a、b 或c。
类似地，使用groupby，也使用.transform 来保持相同的数据框轴。
如果有一个指标，其中一整天为 0，因此没有计数值，那么当天的整列将是 True，因为 0 是最大值。

import pandas as pd
from datetime import date, timedelta

today = pd.Timestamp.today()

# note that the 8 and 7 for china and america are swapped for testing
df = pd.DataFrame([['china',today,1,4,8],  
                     ['america',today,2,5,7], 
                     ['china',date.today() - timedelta(days=1),3,6,9], 
                     ['india',date.today() - timedelta(days=2),4,7,10]], 

                     columns=['country','date', 'a','b','c'])

# add columns using groupby and transform
df[['max_a', 'max_b', 'max_c']] = df.groupby('date')[['a', 'b', 'c']].transform('max') == df[['a', 'b', 'c']]

# display(df)
   country                       date  a  b   c  max_a  max_b  max_c
0    china 2020-09-08 13:14:25.713340  1  4   8  False  False   True
1  america 2020-09-08 13:14:25.713340  2  5   7   True   True  False
2    china 2020-09-07 00:00:00.000000  3  6   9   True   True   True
3    india 2020-09-06 00:00:00.000000  4  7  10   True   True   True

真实 COVID 数据示例

All four metrics 来自Our World Data

import pandas as pd

# load first 6 columns of data and parse dates
df = pd.read_csv('https://raw.githubusercontent.com/trenton3983/stack_overflow/master/data/so_data/2020-09-08%2063800602/covid_data.csv', parse_dates=['date'], usecols=range(6))

# remove World from location, because this is the sum for each day and will always be the max
df = df[df.location != 'World']

# get last four columns, because I'm to lazy to type them
cols = df.columns[-4:]

# find the daily max: 1 line of fast code compared to 7 lines of a for-loop
daily_max = df.groupby('date', as_index=False)[cols].max()

# add column with daily_max
daily_max['location'] = 'daily max'

# combine with df
df_updated = pd.concat([df, daily_max]).sort_values(['date', 'location']).reset_index(drop=True)

显示2020-07-04的尾巴

df_updated[df_updated.date == '2020-07-04'].tail(15)

            date                      location  new_cases  new_deaths  total_cases  total_deaths
28124 2020-07-04                       Ukraine      876.0        27.0      46763.0        1212.0
28125 2020-07-04          United Arab Emirates      672.0         1.0      50141.0         318.0
28126 2020-07-04                United Kingdom      602.0        49.0     286141.0       40581.0
28127 2020-07-04                 United States    54442.0       694.0    2794321.0      129434.0
28128 2020-07-04  United States Virgin Islands       13.0         0.0        111.0           6.0
28129 2020-07-04                       Uruguay        5.0         0.0        952.0          28.0
28130 2020-07-04                    Uzbekistan      301.0         2.0       9500.0          29.0
28131 2020-07-04                       Vatican        0.0         0.0         12.0           0.0
28132 2020-07-04                     Venezuela      264.0         2.0       6537.0          59.0
28133 2020-07-04                       Vietnam        0.0         0.0        355.0           0.0
28134 2020-07-04                Western Sahara       58.0         0.0        519.0           1.0
28135 2020-07-04                         Yemen       19.0        10.0       1240.0         335.0
28136 2020-07-04                        Zambia        0.0         0.0       1632.0          30.0
28137 2020-07-04                      Zimbabwe        8.0         0.0        625.0           7.0
28138 2020-07-04                     daily max    54442.0      1290.0    2794321.0      129434.0

两种方法的示例输出

查看 3 个指标的每日最大值发生在美国

            date                      location  new_cases  new_deaths  total_cases  total_deaths max new_cases max new_deaths max total_cases max total_deaths
28124 2020-07-04                       Ukraine      876.0        27.0      46763.0        1212.0         False          False           False            False
28125 2020-07-04          United Arab Emirates      672.0         1.0      50141.0         318.0         False          False           False            False
28126 2020-07-04                United Kingdom      602.0        49.0     286141.0       40581.0         False          False           False            False
28127 2020-07-04                 United States    54442.0       694.0    2794321.0      129434.0          True          False            True             True
28128 2020-07-04  United States Virgin Islands       13.0         0.0        111.0           6.0         False          False           False            False
28129 2020-07-04                       Uruguay        5.0         0.0        952.0          28.0         False          False           False            False
28130 2020-07-04                    Uzbekistan      301.0         2.0       9500.0          29.0         False          False           False            False
28131 2020-07-04                       Vatican        0.0         0.0         12.0           0.0         False          False           False            False
28132 2020-07-04                     Venezuela      264.0         2.0       6537.0          59.0         False          False           False            False
28133 2020-07-04                       Vietnam        0.0         0.0        355.0           0.0         False          False           False            False
28134 2020-07-04                Western Sahara       58.0         0.0        519.0           1.0         False          False           False            False
28135 2020-07-04                         Yemen       19.0        10.0       1240.0         335.0         False          False           False            False
28136 2020-07-04                        Zambia        0.0         0.0       1632.0          30.0         False          False           False            False
28137 2020-07-04                      Zimbabwe        8.0         0.0        625.0           7.0         False          False           False            False
28138 2020-07-04                     daily max    54442.0      1290.0    2794321.0      129434.0           NaN            NaN             NaN              NaN

【讨论】：

【解决方案2】：

在使用 df.append 时，您实际上必须将其分配回 df 本身。在您的示例中，您正在调用 df.append(new_df) 但没有将其分配给原始 df，因此它是暂时发生的，但是当您打印 df 时，它没有显示任何更改，因为您没有更改原始 df 对象. .append() 不是就地方法。试试：

today = pd.Timestamp.today()

df = pd.DataFrame([['china',today,1,4,7],
                     ['america',today,2,5,8], 
                     ['china',date.today() - timedelta(days=1),3,6,9], 
                     ['india',date.today() - timedelta(days=2),4,7,10]], 

                     columns=['country','date', 'a','b','c'])

print('-----dataframe BEFORE appending rows with daily max values-----')
print(df)

for i in df['date'].unique():
    print('-----new iteration-----')
    print(i) # correct
    temp = df.loc[df['date'] == i]
    max_a = temp['a'].max()
    max_b = temp['b'].max()
    max_c = temp['c'].max()
    new_df = pd.DataFrame([['daily max',i,max_a,max_b,max_c]], columns=['country','date', 'a','b','c'])
                                 
    print('-----new line to be added to the dataframe-----')
    print(new_df) # correct

    df = df.append(new_df) # THIS IS THE CHANGED LINE

    print('-----end of iteration-----')

print(df) # printing the same as the original dataframe :(

这个输出：

...
-----end of iteration-----
     country                       date  a  b   c
0      china 2020-09-08 15:00:37.074594  1  4   7
1    america 2020-09-08 15:00:37.074594  2  5   8
2      china 2020-09-07 00:00:00.000000  3  6   9
3      india 2020-09-06 00:00:00.000000  4  7  10
0  daily max 2020-09-08 15:00:37.074594  2  5   8
0  daily max 2020-09-07 00:00:00.000000  3  6   9
0  daily max 2020-09-06 00:00:00.000000  4  7  10

在打印之前，您可能想要做：

df.reset_index(inplace=True, drop=True)

因为您可以看到每个新添加的索引都是 0。

【讨论】：