【问题标题】:Allocate values between two pandas dataframes在两个熊猫数据帧之间分配值
【发布时间】:2023-01-27 22:34:05
【问题描述】:

考虑两个数据框:

>> import pandas as pd
>> df1 = pd.DataFrame({"category": ["foo", "foo", "bar", "bar", "bar"], "quantity": [1,2,1,2,3]})
>> print(df1)

    category    quantity
0   foo         1
1   foo         2
2   bar         1
3   bar         2
4   bar         3

>> df2 = pd.DataFrame({
            "category": ["foo", "foo", "foo", "foo", "bar", "bar", "bar", "bar", "bar", "bar"], 
            "item": ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J"]
        })
>> print(df2)
      category item
0      foo      A
1      foo      B
2      foo      C
3      foo      D
4      bar      E
5      bar      F
6      bar      G
7      bar      H
8      bar      I
9      bar      J

我如何在df1(名为df3的新数据框)中创建一个新列,它加入df1category列并分配df2中的item列。因此,创建如下内容:

>> df3 = pd.DataFrame({
           "category": ["foo", "foo", "bar", "bar", "bar"], 
           "quantity": [1,2,1,2,3],
           "item": ["A", "B,C", "E", "F,G", "H,I,J"] 
})

     category  quantity   item
0      foo         1      A
1      foo         2      B,C
2      bar         1      E
3      bar         2      F,G
4      bar         3      H,I,J

【问题讨论】:

    标签: pandas dataframe group-by


    【解决方案1】:

    您可以通过quantity列和Index.repeatDataFrame.loc重复行来创建辅助DataFrame,将索引转换为列以避免丢失indices并在两个DataFrame中创建辅助列g以通过重复的categories合并GroupBy.cumcount,然后使用DataFrame.merge和聚合join

    df11 = (df1.loc[df1.index.repeat(df1['quantity'])].reset_index()
               .assign(g = lambda x: x.groupby('category').cumcount()))
    
    df22 = df2.assign(g = df2.groupby('category').cumcount())
    
    df = (df11.merge(df22, on=['g','category'], how='left')
              .groupby(['index','category','quantity'])['item']
              .agg(lambda x: ','.join(x.dropna()))
              .droplevel(0)
              .reset_index())
    print (df)
      category  quantity   item
    0      foo         1      A
    1      foo         2    B,C
    2      bar         1      E
    3      bar         2    F,G
    4      bar         3  H,I,J
    

    【讨论】:

      【解决方案2】:

      您可以将迭代器与 itertools.islice 一起使用:

      from itertools import islice
      
      # aggregate the items as iterator
      s = df2.groupby('category')['item'].agg(iter)
      
      # for each category, allocate as many items as needed and join
      df1['item'] = (df1.groupby('category', group_keys=False)['quantity']
                        .apply(lambda g:
                               g.map(lambda x: ','.join(list(islice(s[g.name], x)))))
                     )
      

      输出:

        category  quantity   item
      0      foo         1      A
      1      foo         2    B,C
      2      bar         1      E
      3      bar         2    F,G
      4      bar         3  H,I,J
      

      请注意,如果您没有足够的物品,这将只使用可用的物品。

      使用在 F 之后截断的 df2 作为输入的示例:

        category  quantity item
      0      foo         1    A
      1      foo         2  B,C
      2      bar         1    E
      3      bar         2    F
      4      bar         3     
      

      【讨论】:

      • 如果效率很重要,请注意此解决方案的速度要快 5 倍以上;)
      猜你喜欢
      • 1970-01-01
      • 2020-10-11
      • 2017-10-21
      • 1970-01-01
      • 2020-10-15
      • 2019-06-21
      • 1970-01-01
      • 2015-08-09
      • 2022-08-22
      相关资源
      最近更新 更多