【问题标题】:Find earliest date within daterange查找日期范围内的最早日期
【发布时间】:2021-01-24 04:10:58
【问题描述】:

我有以下市场数据:

data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                   'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                   'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})

data['date'] = pd.to_datetime(data)

data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]

data = data.set_index('date')

我正在尝试在日期列中查找当月第一天的现货值。我可以通过以下方式找到第一个工作日:

def get_month_beg(d):
    month_beg = (d.index + pd.offsets.BMonthEnd(0) - pd.offsets.MonthBegin(normalize=True)) 
    return month_beg

data['month_beg'] =  get_month_beg(data)

但是,由于数据问题,有时我的数据中最早的日期与当月的第一个工作日不一致。

我们将每个月最早的现货价值称为“罢工”,这就是我想要找到的。因此,对于 10 月,现货价格为 77.3438(21 年 10 月 1 日),11 月为 80.5313(21 年 11 月 2 日,而非 21 年 11 月 1 日)。

我在下面尝试过,只有当我的数据的最早日期与当月的第一个营业日期匹配时才有效(例如,它在 10 月有效,但在 11 月无效)

 data['strike'] = data.month_beg.map(data.spot)

如您所见,我在 11 月得到 NaN,因为我的数据中的第一个工作日是 11/2(即期汇率 80.5313)而不是 11/1。有谁知道如何找到日期范围内的最早日期(在这种情况下是每个月的最早日期)?

我希望最终的 df 如下所示:

data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                   'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                   'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})

data['date'] = pd.to_datetime(data)

data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]

data['strike'] = [77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313]

data = data.set_index('date')

【问题讨论】:

    标签: python-3.x pandas dataframe datetime


    【解决方案1】:

    我相信,我们可以得到每个年份和月份组合的 first(),然后将其与主要数据相结合。

    data2=data.groupby(['year','month']).first().reset_index()
    #join data 2 with data based on month and year later on
    
       year  month  day     spot
    0  2020     10    1  77.3438
    1  2020     11    2  80.5313
    

    基于这个问题,我的理解是我们需要获取每个月的第一天和各自的“SPOT”列值。

    如果我理解错了,请纠正我。

    【讨论】:

    • 是否可以添加“strike”列以使数据框保持不变(但添加了列)?所以最终的数据框包含所有日期、现货值和罢工列(10 月的每个日期都是 77.3438,11 月的每个日期都是 80.5313)?我在最后的问题中添加了我希望最终的 df 喜欢的方式。
    • @HDBrew data['strike']=data.groupby(['year','month'])['spot'].transform('first') 我们可以试试这个,应该可以在这种情况下
    【解决方案2】:

    Strike = 每个月第一天的现货值

    为此,我们需要执行以下操作:

    • 步骤 1. 从日期列中获取年/月值。换个方式,我们 可以使用 DataFrame 中已有的 YearMonth 列。
    • 第 2 步:我们需要按年和月分组。这将给所有 按年+月记录。由此,我们需要得到第一条记录 (这将是本月的最早日期)。最早的日期可以 根据 列。
    • 第 3 步:通过在 Groupby 中使用转换,pandas 将返回 结果与数据帧长度匹配。因此,对于每条记录,它将 发送相同的结果。在这个例子中,我们只有 2 个月(Oct & 十一月)。但是,我们有 42 行。 Transform 将返回 42 行。 代码: groupby('[year','month'])['date'].transform('first') 将给出 每月的第一天。

    使用这个:

    data['dy'] = data.groupby(['year','month'])['date'].transform('first')
    

    或:

    data['dx'] = data.date.dt.to_period('M') #to get yyyy-mm value
    
    • 第四步:使用变换,我们也可以得到Spot值。这可以是 分配给 Strike 给我们想要的结果。而不是得到 当月的第一天,我们可以将其更改为返回 Spot 值。 代码将是:groupby('date')['spot'].transform('first')

    使用这个:

    data['strike'] = data.groupby(['year','month'])['spot'].transform('first')
    

    data['strike'] = data.groupby('dx')['spot'].transform('first')
    

    将所有这些放在一起

    使用当月第一天的现货价格获取执行价格的完整代码

    import pandas as pd
    import numpy as np
    
    data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                       'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                       'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})
    
    data['date'] = pd.to_datetime(data)
    
    data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]
    
    #Pick the first day of month Spot price as the Strike price
    data['strike'] = data.groupby(['year','month'])['spot'].transform('first')
    
    #This will give you the first row of each month
    print (data)
    

    这个输出将是:

        year  month  day       date     spot   strike
    0   2020     10    1 2020-10-01  77.3438  77.3438
    1   2020     10    2 2020-10-02  78.1920  77.3438
    2   2020     10    5 2020-10-05  78.1044  77.3438
    3   2020     10    6 2020-10-06  78.4357  77.3438
    4   2020     10    7 2020-10-07  78.0285  77.3438
    5   2020     10    8 2020-10-08  77.3507  77.3438
    6   2020     10    9 2020-10-09  76.7800  77.3438
    7   2020     10   12 2020-10-12  77.1300  77.3438
    8   2020     10   13 2020-10-13  77.0417  77.3438
    9   2020     10   14 2020-10-14  77.6525  77.3438
    10  2020     10   15 2020-10-15  78.0906  77.3438
    11  2020     10   16 2020-10-16  77.9100  77.3438
    12  2020     10   19 2020-10-19  77.6602  77.3438
    13  2020     10   20 2020-10-20  77.3568  77.3438
    14  2020     10   21 2020-10-21  76.7243  77.3438
    15  2020     10   22 2020-10-22  76.5872  77.3438
    16  2020     10   23 2020-10-23  76.1374  77.3438
    17  2020     10   26 2020-10-26  76.4435  77.3438
    18  2020     10   27 2020-10-27  77.2906  77.3438
    19  2020     10   28 2020-10-28  79.2239  77.3438
    20  2020     10   29 2020-10-29  78.8993  77.3438
    21  2020     10   30 2020-10-30  79.5305  77.3438
    22  2020     11    2 2020-11-02  80.5313  80.5313
    23  2020     11    3 2020-11-03  79.3615  80.5313
    24  2020     11    5 2020-11-05  77.0156  80.5313
    25  2020     11    6 2020-11-06  77.4226  80.5313
    26  2020     11    9 2020-11-09  76.2880  80.5313
    27  2020     11   10 2020-11-10  76.5648  80.5313
    28  2020     11   11 2020-11-11  77.1171  80.5313
    29  2020     11   12 2020-11-12  77.3568  80.5313
    30  2020     11   13 2020-11-13  77.3740  80.5313
    31  2020     11   16 2020-11-16  76.1758  80.5313
    32  2020     11   17 2020-11-17  76.2325  80.5313
    33  2020     11   18 2020-11-18  76.0401  80.5313
    34  2020     11   19 2020-11-19  76.0529  80.5313
    35  2020     11   20 2020-11-20  76.1992  80.5313
    36  2020     11   23 2020-11-23  76.1648  80.5313
    37  2020     11   24 2020-11-24  75.4740  80.5313
    38  2020     11   25 2020-11-25  75.5510  80.5313
    39  2020     11   26 2020-11-26  75.7018  80.5313
    40  2020     11   27 2020-11-27  75.8639  80.5313
    41  2020     11   30 2020-11-30  76.3944  80.5313
    

    获取每个月的第一天的上一个答案(在列数据内)

    一种方法是创建一个虚拟列来存储每个月的第一天。然后使用 drop_duplicates() 并只保留第一行。

    关键假设: 这种逻辑的假设是我们每个月至少有 2 行。如果一个月只有一行,那么它将不是重复的一部分,您将不会获得该月的数据。

    这会给你每个月的第一天。

    import pandas as pd
    import numpy as np
    
    data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                       'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                       'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})
    
    data['date'] = pd.to_datetime(data)
    
    data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]
    
    #create a dummy column to store the first day of the month
    data['dx'] = data.date.dt.to_period('M')
    
    #drop duplicates while retaining only the first row of each month
    dx = data.drop_duplicates('dx',keep='first')
    
    #This will give you the first row of each month
    print (dx)
    

    这个输出将是:

        year  month  day       date     spot       dx
    0   2020     10    1 2020-10-01  77.3438  2020-10
    22  2020     11    2 2020-11-02  80.5313  2020-11
    

    如果给定月份只有一行,那么您可以使用 groupby the month 并获取第一条记录。

    data.groupby(['dx']).first()
    

    这会给你:

             year  month  day       date     spot
    dx                                           
    2020-10  2020     10    1 2020-10-01  77.3438
    2020-11  2020     11    2 2020-11-02  80.5313
    

    【讨论】:

    • 是否可以添加“strike”列以使数据框保持不变(但添加了列)?所以最终的数据框包含所有日期、现货值和罢工列(10 月的每个日期都是 77.3438,11 月的每个日期都是 80.5313)?我在最后的问题中添加了我希望最终的 df 喜欢的方式。
    • 是的,可以。让我们添加并发布
    • @HDBrew,请参阅我的新更新答案。如果这是您想要的,请告诉我。
    【解决方案3】:
    data['strike']=data.groupby(['year','month'])['spot'].transform('first')
    

    我想这可以在不创建任何其他数据框的情况下实现。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2016-09-22
      • 2016-10-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-01-07
      • 1970-01-01
      相关资源
      最近更新 更多