【问题标题】:Create Multiple Dataframes using Loop & function使用循环和函数创建多个数据框
【发布时间】:2020-05-29 11:57:28
【问题描述】:

我有一个超过 1M 行的 df 类似于此

ID  Date    Amount
x   May 1   10
y   May 2   20
z   May 4   30
x   May 1   40
y   May 1   50
z   May 2   60
x   May 1   70
y   May 5   80
a   May 6   90
b   May 8   100
x   May 10  110

我必须根据日期对数据进行排序,然后根据值在“金额”列中出现的时间创建新的数据框。因此,如果 x 已经购买了 3 次,那么我需要在 3 个不同的数据帧中使用它。 first_purchase 数据框将包含每个购买过一次的 ID,无论日期或金额如何。 如果一个 ID 购买了 3 次,我需要该 ID 出现在第一次购买中,然后是第二次购买,然后是第三次购买日期和金额。

手动操作很容易:-

df = df.sort_values('Date')
first_purchase = df.drop_duplicates('ID')
after_1stpurchase = df[~df.index.isin(first_purchase.index)]

将创建第二个数据框:-

after_1stpurchase = after_1stpurchase.sort_values('Date')
second_purchase = after_1stpurchase.drop_duplicates('ID')
after_2ndpurchase = after_1stpurchase[~after_1stpurchase.index.isin(second_purchase.index)]

如何创建循环以向我提供每个数据帧?

【问题讨论】:

    标签: python-3.x pandas loops


    【解决方案1】:

    IIUC,我能够实现您想要的。

    import pandas as pd
    import numpy as np
    
    # source data for the dataframe
    data = {
    "ID":["x","y","z","x","y","z","x","y","a","b","x"],
    "Date":["May 01","May 02","May 04","May 01","May 01","May 02","May 01","May 05","May 06","May 08","May 10"],
    "Amount":[10,20,30,40,50,60,70,80,90,100,110]
    }
    
    df = pd.DataFrame(data)
    
    # convert the Date column to datetime and still maintain the format like "May 01"
    df['Date'] = pd.to_datetime(df['Date'], format='%b %d').dt.strftime('%b %d')
    
    # sort the values on ID and Date
    df.sort_values(by=['ID', 'Date'], inplace=True)
    df.reset_index(inplace=True, drop=True)
    
    print(df)
    

    原始数据框:

        Amount    Date ID
    0       90  May 06  a
    1      100  May 08  b
    2       10  May 01  x
    3       40  May 01  x
    4       70  May 01  x
    5      110  May 10  x
    6       50  May 01  y
    7       20  May 02  y
    8       80  May 05  y
    9       60  May 02  z
    10      30  May 04  z
    

    .

    # create a list of unique ids
    list_id = sorted(list(set(df['ID'])))
    
    # create an empty list that would contain dataframes
    df_list = []
    
    # count of iterations that must be seperated out
    # for example if we want to record 3 entries for 
    # each id, the iter would be 3. This will create
    # three new dataframes that will hold transactions
    # respectively. 
    iter = 3
    for i in range(iter):
        df_list.append(pd.DataFrame())
    
    
    for val in list_id:
        tmp_df = df.loc[df['ID'] == val].reset_index(drop=True)
    
        # consider only the top iter(=3) values to be distributed
        counter = np.minimum(tmp_df.shape[0], iter)
        for idx in range(counter):
            df_list[idx] = df_list[idx].append(tmp_df.loc[tmp_df.index == idx])
    
    for df in df_list:
        df.reset_index(drop=True, inplace=True)
        print(df)
    

    交易#1:

       Amount    Date ID
    0      90  May 06  a
    1     100  May 08  b
    2      10  May 01  x
    3      50  May 01  y
    4      60  May 02  z
    

    交易#2:

       Amount    Date ID
    0      40  May 01  x
    1      20  May 02  y
    2      30  May 04  z
    

    交易#3:

       Amount    Date ID
    0      70  May 01  x
    1      80  May 05  y
    

    请注意,在您的数据中,“x”有四个交易。如果假设您也想跟踪第 4 次迭代事务。您需要做的就是将“iter”的值更改为 4,您将获得第四个数据帧以及以下值:

       Amount    Date ID
    0     110  May 10  x
    

    【讨论】:

      猜你喜欢
      • 2018-07-30
      • 2015-08-18
      • 1970-01-01
      • 1970-01-01
      • 2019-10-14
      • 1970-01-01
      • 2022-01-26
      • 1970-01-01
      相关资源
      最近更新 更多