【问题标题】:Python Pandas - Dynamic matching of different date indicesPython Pandas - 不同日期索引的动态匹配
【发布时间】:2019-10-21 16:01:52
【问题描述】:

我有两个具有不同时间序列数据的数据框(请参见下面的示例)。 Dataframe1 每月包含多个每日观察,而 Dataframe2 每月仅包含一个观察。

我现在要做的是将Dataframe2中的数据与Dataframe1中每个月的最后一天对齐。 Dataframe1 中每月的最后一天不一定是相应日历月的最后一天。

感谢所有提示如何以有效的方式解决这个问题(因为数据框可能非常大)

Dataframe1
----------------------------------
date            A          B        
1980-12-31      152.799    209.132
1981-01-01      152.799    209.132
1981-01-02      152.234    209.517
1981-01-05      152.895    211.790
1981-01-06      155.131    214.023
1981-01-07      152.596    213.044
1981-01-08      151.232    211.810
1981-01-09      150.518    210.887
1981-01-12      149.899    210.340
1981-01-13      147.588    207.621
1981-01-14      148.231    208.076
1981-01-15      148.521    208.676
1981-01-16      148.931    209.278
1981-01-19      149.824    210.372
1981-01-20      149.849    210.454
1981-01-21      150.353    211.644
1981-01-22      149.398    210.042
1981-01-23      148.748    208.654
1981-01-26      148.879    208.355
1981-01-27      148.671    208.431
1981-01-28      147.612    207.525
1981-01-29      147.153    206.595
1981-01-30      146.330    205.558
1981-02-02      145.779    206.635
Dataframe2
---------------------------------          
date                C        D     
1981-01-13          53.4     56.5
1981-02-15          52.2     60.0
1981-03-15          51.8     58.0
1981-04-14          51.8     59.5
1981-05-16          50.7     58.0
1981-06-15          50.3     59.5
1981-07-15          50.6     53.5
1981-08-17          50.1     44.5
1981-09-12          50.6     38.5

【问题讨论】:

    标签: python pandas dataframe time-series


    【解决方案1】:

    为了提供一个可读的例子,我准备了如下测试数据:

    df1 - 1 月和 2 月的一些观察结果:

            date        A        B
    0 1981-01-02  152.234  209.517
    1 1981-01-07  152.596  213.044
    2 1981-01-13  147.588  207.621
    3 1981-01-20  151.232  211.810
    4 1981-01-27  150.518  210.887
    5 1981-02-05  149.899  210.340
    6 1981-02-14  152.895  211.790
    7 1981-02-16  155.131  214.023
    8 1981-02-21  180.000  200.239
    

    df2 - 您的数据,同样来自 1 月和 2 月:

            date     C     D
    0 1981-01-13  53.4  56.5
    1 1981-02-15  52.2  60.0
    

    两个数据框都有 date 类型的 datetime 列。

    df1获取每个月的最后一次观察开始:

    res1 = df1.groupby(df1.date.dt.to_period('M')).tail(1)
    

    根据我的数据,结果是:

            date        A        B
    4 1981-01-27  150.518  210.887
    8 1981-02-21  180.000  200.239
    

    然后,要连接观察,连接必须在 整月期间,而不是确切的日期。为此,请运行:

    res = pd.merge(res1.assign(month=res1['date'].dt.to_period('M')),
        df2.assign(month=df2['date'].dt.to_period('M')),
        how='left', on='month', suffixes=('_1', '_2'), )
    

    结果是:

          date_1        A        B   month     date_2     C     D
    0 1981-01-27  150.518  210.887 1981-01 1981-01-13  53.4  56.5
    1 1981-02-21  180.000  200.239 1981-02 1981-02-15  52.2  60.0
    

    如果您希望合并包含几个月的数据 在 df1df2 中至少有一个观察结果,删除 how 参数。 它的默认值为inner,在这种情况下是正确的模式。

    【讨论】:

    • 只是为了澄清:我想将 dataframe2 中的观察结果与 dataframe1 中相应月份的最后每日观察结果对齐。
    • 我更正了我的答案。现在,来自 df2 的数据与 df1 中每个月的 last 观察值对齐。
    【解决方案2】:

    当您有一个示例数据框时,您可以为此提供代码。只需选择一列作为列表(第 1 步和第 2 步),然后使用该列表通过代码构建数据框(第 3 步和第 4 步)。

    import pandas as pd
    
    # Step 1: create your dataframe, and print each column as a list, copy-paste into code example below.
    df_1 = pd.read_csv('dataset1.csv')
    print(list(df_1['date']))
    print(list(df_1['A']))
    print(list(df_1['B']))
    
    # Step 2: create your dataframe, and print each column as a list, copy-paste into code example below.
    df_2 = pd.read_csv('dataset2.csv')
    print(list(df_2['date']))
    print(list(df_2['C']))
    print(list(df_2['D']))
    
    # Step 3: create sample dataframe ... good if you can provide this in your future questions
    df_1 = pd.DataFrame({
        'date': ['12/31/1980', '1/1/1981', '1/2/1981', '1/5/1981', '1/6/1981', 
                 '1/7/1981', '1/8/1981', '1/9/1981', '1/12/1981', '1/13/1981',
                 '1/14/1981', '1/15/1981', '1/16/1981', '1/19/1981', '1/20/1981',
                 '1/21/1981', '1/22/1981', '1/23/1981', '1/26/1981', '1/27/1981',
                 '1/28/1981', '1/29/1981', '1/30/1981', '2/2/1981'],
        'A': [152.799, 152.799, 152.234, 152.895, 155.131,
              152.596, 151.232, 150.518, 149.899, 147.588,
              148.231, 148.521, 148.931, 149.824, 149.849,
              150.353, 149.398, 148.748, 148.879, 148.671,
              147.612, 147.153, 146.33, 145.779],
        'B': [209.132, 209.132, 209.517, 211.79, 214.023,
              213.044, 211.81, 210.887, 210.34, 207.621,
              208.076, 208.676, 209.278, 210.372, 210.454,
              211.644, 210.042, 208.654, 208.355, 208.431,
              207.525, 206.595, 205.558, 206.635]
    })
    
    # Step 4: create sample dataframe ... good if you can provide this in your future questions
    df_2 = pd.DataFrame({
        'date': ['1/13/1981', '2/15/1981', '3/15/1981', '4/14/1981', '5/16/1981',
                 '6/15/1981', '7/15/1981', '8/17/1981', '9/12/1981'],
        'C': [53.4, 52.2, 51.8, 51.8, 50.7, 50.3, 50.6, 50.1, 50.6],
        'D': [56.5, 60.0, 58.0, 59.5, 58.0, 59.5, 53.5, 44.5, 38.5]
    })
    
    # Step 5: make sure the date field is actually a date, not a string
    df_1['date'] = pd.to_datetime(df_1['date']).dt.date
    
    # Step 6: create new colum with year and month
    df_1['date_year_month'] = pd.to_datetime(df_1['date']).dt.to_period('M')
    
    # Step 7: create boolean mask that grabs the max date for each year-month
    mask_last_day_month = df_1.groupby('date_year_month')['date'].transform(max) == df_1['date']
    
    # Step 8: create new dataframe with only last day of month
    df_1_max = df_1.loc[mask_last_day_month]
    print('here is dataframe 1 with only last day in the month')
    print(df_1_max)
    print()
    
    # Step 9: make sure the date field is actually a date, not a string
    df_2['date'] = pd.to_datetime(df_2['date']).dt.date
    
    # Step 10: create new colum with year and month
    df_2['date_year_month'] = pd.to_datetime(df_2['date']).dt.to_period('M')
    print('here is the original dataframe 2')
    print(df_2)
    print()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-01-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多