【问题标题】：Find earliest date within daterange查找日期范围内的最早日期
【发布时间】：2021-01-24 04:10:58
【问题描述】：

我有以下市场数据：

data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                   'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                   'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})

data['date'] = pd.to_datetime(data)

data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]

data = data.set_index('date')

我正在尝试在日期列中查找当月第一天的现货值。我可以通过以下方式找到第一个工作日：

def get_month_beg(d):
    month_beg = (d.index + pd.offsets.BMonthEnd(0) - pd.offsets.MonthBegin(normalize=True)) 
    return month_beg

data['month_beg'] =  get_month_beg(data)

但是，由于数据问题，有时我的数据中最早的日期与当月的第一个工作日不一致。

我们将每个月最早的现货价值称为“罢工”，这就是我想要找到的。因此，对于 10 月，现货价格为 77.3438（21 年 10 月 1 日），11 月为 80.5313（21 年 11 月 2 日，而非 21 年 11 月 1 日）。

我在下面尝试过，只有当我的数据的最早日期与当月的第一个营业日期匹配时才有效（例如，它在 10 月有效，但在 11 月无效）

 data['strike'] = data.month_beg.map(data.spot)

如您所见，我在 11 月得到 NaN，因为我的数据中的第一个工作日是 11/2（即期汇率 80.5313）而不是 11/1。有谁知道如何找到日期范围内的最早日期（在这种情况下是每个月的最早日期）？

我希望最终的 df 如下所示：

data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                   'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                   'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})

data['date'] = pd.to_datetime(data)

data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]

data['strike'] = [77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,77.3438,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313,80.5313]

data = data.set_index('date')

【问题讨论】：

标签： python-3.x pandas dataframe datetime

【解决方案1】：

我相信，我们可以得到每个年份和月份组合的 first()，然后将其与主要数据相结合。

data2=data.groupby(['year','month']).first().reset_index()
#join data 2 with data based on month and year later on

   year  month  day     spot
0  2020     10    1  77.3438
1  2020     11    2  80.5313

基于这个问题，我的理解是我们需要获取每个月的第一天和各自的“SPOT”列值。

如果我理解错了，请纠正我。

【讨论】：

是否可以添加“strike”列以使数据框保持不变（但添加了列）？所以最终的数据框包含所有日期、现货值和罢工列（10 月的每个日期都是 77.3438，11 月的每个日期都是 80.5313）？我在最后的问题中添加了我希望最终的 df 喜欢的方式。
@HDBrew data['strike']=data.groupby(['year','month'])['spot'].transform('first') 我们可以试试这个，应该可以在这种情况下

【解决方案2】：

Strike = 每个月第一天的现货值

为此，我们需要执行以下操作：

步骤 1. 从日期列中获取年/月值。换个方式，我们可以使用 DataFrame 中已有的 Year 和 Month 列。
第 2 步：我们需要按年和月分组。这将给所有按年+月记录。由此，我们需要得到第一条记录（这将是本月的最早日期）。最早的日期可以根据列。
第 3 步：通过在 Groupby 中使用转换，pandas 将返回结果与数据帧长度匹配。因此，对于每条记录，它将发送相同的结果。在这个例子中，我们只有 2 个月（Oct & 十一月）。但是，我们有 42 行。 Transform 将返回 42 行。代码： groupby('[year','month'])['date'].transform('first') 将给出每月的第一天。

使用这个：

data['dy'] = data.groupby(['year','month'])['date'].transform('first')

或：

data['dx'] = data.date.dt.to_period('M') #to get yyyy-mm value

第四步：使用变换，我们也可以得到Spot值。这可以是分配给 Strike 给我们想要的结果。而不是得到当月的第一天，我们可以将其更改为返回 Spot 值。代码将是：groupby('date')['spot'].transform('first')

使用这个：

data['strike'] = data.groupby(['year','month'])['spot'].transform('first')

或

data['strike'] = data.groupby('dx')['spot'].transform('first')

将所有这些放在一起

使用当月第一天的现货价格获取执行价格的完整代码

import pandas as pd
import numpy as np

data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                   'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                   'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})

data['date'] = pd.to_datetime(data)

data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]

#Pick the first day of month Spot price as the Strike price
data['strike'] = data.groupby(['year','month'])['spot'].transform('first')

#This will give you the first row of each month
print (data)

这个输出将是：

    year  month  day       date     spot   strike
0   2020     10    1 2020-10-01  77.3438  77.3438
1   2020     10    2 2020-10-02  78.1920  77.3438
2   2020     10    5 2020-10-05  78.1044  77.3438
3   2020     10    6 2020-10-06  78.4357  77.3438
4   2020     10    7 2020-10-07  78.0285  77.3438
5   2020     10    8 2020-10-08  77.3507  77.3438
6   2020     10    9 2020-10-09  76.7800  77.3438
7   2020     10   12 2020-10-12  77.1300  77.3438
8   2020     10   13 2020-10-13  77.0417  77.3438
9   2020     10   14 2020-10-14  77.6525  77.3438
10  2020     10   15 2020-10-15  78.0906  77.3438
11  2020     10   16 2020-10-16  77.9100  77.3438
12  2020     10   19 2020-10-19  77.6602  77.3438
13  2020     10   20 2020-10-20  77.3568  77.3438
14  2020     10   21 2020-10-21  76.7243  77.3438
15  2020     10   22 2020-10-22  76.5872  77.3438
16  2020     10   23 2020-10-23  76.1374  77.3438
17  2020     10   26 2020-10-26  76.4435  77.3438
18  2020     10   27 2020-10-27  77.2906  77.3438
19  2020     10   28 2020-10-28  79.2239  77.3438
20  2020     10   29 2020-10-29  78.8993  77.3438
21  2020     10   30 2020-10-30  79.5305  77.3438
22  2020     11    2 2020-11-02  80.5313  80.5313
23  2020     11    3 2020-11-03  79.3615  80.5313
24  2020     11    5 2020-11-05  77.0156  80.5313
25  2020     11    6 2020-11-06  77.4226  80.5313
26  2020     11    9 2020-11-09  76.2880  80.5313
27  2020     11   10 2020-11-10  76.5648  80.5313
28  2020     11   11 2020-11-11  77.1171  80.5313
29  2020     11   12 2020-11-12  77.3568  80.5313
30  2020     11   13 2020-11-13  77.3740  80.5313
31  2020     11   16 2020-11-16  76.1758  80.5313
32  2020     11   17 2020-11-17  76.2325  80.5313
33  2020     11   18 2020-11-18  76.0401  80.5313
34  2020     11   19 2020-11-19  76.0529  80.5313
35  2020     11   20 2020-11-20  76.1992  80.5313
36  2020     11   23 2020-11-23  76.1648  80.5313
37  2020     11   24 2020-11-24  75.4740  80.5313
38  2020     11   25 2020-11-25  75.5510  80.5313
39  2020     11   26 2020-11-26  75.7018  80.5313
40  2020     11   27 2020-11-27  75.8639  80.5313
41  2020     11   30 2020-11-30  76.3944  80.5313

获取每个月的第一天的上一个答案（在列数据内）

一种方法是创建一个虚拟列来存储每个月的第一天。然后使用 drop_duplicates() 并只保留第一行。

关键假设： 这种逻辑的假设是我们每个月至少有 2 行。如果一个月只有一行，那么它将不是重复的一部分，您将不会获得该月的数据。

这会给你每个月的第一天。

import pandas as pd
import numpy as np

data = pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020],
                   'month': [10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,10,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11,11],
                   'day': [1,2,5,6,7,8,9,12,13,14,15,16,19,20,21,22,23,26,27,28,29,30,2,3,5,6,9,10,11,12,13,16,17,18,19,20,23,24,25,26,27,30]})

data['date'] = pd.to_datetime(data)

data['spot'] = [77.3438,78.192,78.1044,78.4357,78.0285,77.3507,76.78,77.13,77.0417,77.6525,78.0906,77.91,77.6602,77.3568,76.7243,76.5872,76.1374,76.4435,77.2906,79.2239,78.8993,79.5305,80.5313,79.3615,77.0156,77.4226,76.288,76.5648,77.1171,77.3568,77.374,76.1758,76.2325,76.0401,76.0529,76.1992,76.1648,75.474,75.551,75.7018,75.8639,76.3944]

#create a dummy column to store the first day of the month
data['dx'] = data.date.dt.to_period('M')

#drop duplicates while retaining only the first row of each month
dx = data.drop_duplicates('dx',keep='first')

#This will give you the first row of each month
print (dx)

这个输出将是：

    year  month  day       date     spot       dx
0   2020     10    1 2020-10-01  77.3438  2020-10
22  2020     11    2 2020-11-02  80.5313  2020-11

如果给定月份只有一行，那么您可以使用 groupby the month 并获取第一条记录。

data.groupby(['dx']).first()

这会给你：

         year  month  day       date     spot
dx                                           
2020-10  2020     10    1 2020-10-01  77.3438
2020-11  2020     11    2 2020-11-02  80.5313

【讨论】：

是否可以添加“strike”列以使数据框保持不变（但添加了列）？所以最终的数据框包含所有日期、现货值和罢工列（10 月的每个日期都是 77.3438，11 月的每个日期都是 80.5313）？我在最后的问题中添加了我希望最终的 df 喜欢的方式。
是的，可以。让我们添加并发布
@HDBrew，请参阅我的新更新答案。如果这是您想要的，请告诉我。

【解决方案3】：

data['strike']=data.groupby(['year','month'])['spot'].transform('first')

我想这可以在不创建任何其他数据框的情况下实现。

【讨论】：