【问题标题】:auto increment inside group组内自动递增
【发布时间】:2020-03-17 17:59:43
【问题描述】:

我有一个dataframe

df = pd.DataFrame.from_dict({
    'product': ('a', 'a', 'a', 'a', 'c', 'b', 'b', 'b'),
    'sales': ('-', '-', 'hot_price', 'hot_price', '-', 'min_price', 'min_price', 'min_price'),
    'price': (100, 100, 50, 50, 90, 70, 70, 70),
    'dt': ('2020-01-01 00:00:00', '2020-01-01 00:05:00', '2020-01-01 00:07:00', '2020-01-01 00:10:00', '2020-01-01 00:13:00', '2020-01-01 00:15:00', '2020-01-01 00:19:00', '2020-01-01 00:21:00')
})

  product      sales  price                   dt
0       a          -    100  2020-01-01 00:00:00
1       a          -    100  2020-01-01 00:05:00
2       a  hot_price     50  2020-01-01 00:07:00
3       a  hot_price     50  2020-01-01 00:10:00
4       c          -     90  2020-01-01 00:13:00
5       b  min_price     70  2020-01-01 00:15:00
6       b  min_price     70  2020-01-01 00:19:00
7       b  min_price     70  2020-01-01 00:21:00

我需要下一个输出:

  product      sales  price                   dt  unique_group
0       a          -    100  2020-01-01 00:00:00             0
1       a          -    100  2020-01-01 00:05:00             0
2       a  hot_price     50  2020-01-01 00:07:00             1
3       a  hot_price     50  2020-01-01 00:10:00             1
4       c          -     90  2020-01-01 00:13:00             2
5       b  min_price     70  2020-01-01 00:15:00             3
6       b  min_price     70  2020-01-01 00:19:00             3
7       b  min_price     70  2020-01-01 00:21:00             3

我是怎么做的:

unique_group = 0
df['unique_group'] = unique_group
for i in range(1, len(df)):
    current, prev = df.loc[i], df.loc[i - 1]
    if not all([
        current['product'] == prev['product'],
        current['sales'] == prev['sales'],
        current['price'] == prev['price'],
    ]):
        unique_group += 1
    df.loc[i, 'unique_group'] = unique_group

没有迭代就可以做到吗?我尝试使用cumsum()shift()ngroup()drop_duplicates(),但没有成功。

【问题讨论】:

    标签: pandas grouping


    【解决方案1】:

    IIUC,GroupBy.ngroup

    df['unique_group'] = df.groupby(['product', 'sales', 'price'],sort=False).ngroup()
    print(df)
    
      product      sales  price                   dt  unique_group
    0       a          -    100  2020-01-01 00:00:00             0
    1       a          -    100  2020-01-01 00:05:00             0
    2       a  hot_price     50  2020-01-01 00:07:00             1
    3       a  hot_price     50  2020-01-01 00:10:00             1
    4       c          -     90  2020-01-01 00:13:00             2
    5       b  min_price     70  2020-01-01 00:15:00             3
    6       b  min_price     70  2020-01-01 00:19:00             3
    7       b  min_price     70  2020-01-01 00:21:00             3
    

    这两种方式都可以,即使数据框没有排序

    另一种方法

    这适用于有序数据框

    cols = ['product','sales','price']
    df['unique_group'] = df[cols].ne(df[cols].shift()).any(axis=1).cumsum().sub(1)
    

    【讨论】:

      【解决方案2】:

      另一个可能比groupby 快一点的选项:

      df['unique_group'] = (~df.duplicated(['product','sales','price'])).cumsum() - 1
      

      输出:

        product      sales  price                   dt  unique_group
      0       a          -    100  2020-01-01 00:00:00             0
      1       a          -    100  2020-01-01 00:05:00             0
      2       a  hot_price     50  2020-01-01 00:07:00             1
      3       a  hot_price     50  2020-01-01 00:10:00             1
      4       c          -     90  2020-01-01 00:13:00             2
      5       b  min_price     70  2020-01-01 00:15:00             3
      6       b  min_price     70  2020-01-01 00:19:00             3
      7       b  min_price     70  2020-01-01 00:21:00             3
      

      【讨论】:

      • 不错的一个:) @Quang Hoang
      猜你喜欢
      • 2010-10-15
      • 2011-01-25
      • 2016-03-07
      • 2021-10-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多