Python pandas 根据日期时间条件选择行答案

【问题标题】：Python pandas select rows based on datetime conditionPython pandas 根据日期时间条件选择行
【发布时间】：2022-02-22 03:20:48
【问题描述】：

这是示例模拟数据的代码。实际数据可能有不同的开始和结束日期。

import pandas as pd
import numpy as np  

dates = pd.date_range("20100121", periods=3653)   
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))    
dfb=df.resample('B').apply(lambda x:x[-1])

从 dfb 中，我想选择包含该月所有日期的值的行。在 dfb 中，2010 年 1 月和 2020 年 1 月的数据不完整。所以我想要从 2010 年 2 月到 2019 年 12 月的数据。

对于这个特定的数据集，我可以做

df_out=dfb['2010-02':'2019-12']

但请帮助我找到更好的解决方案

编辑——似乎这个问题有很多混乱。我想省略不以该月的第一天开始的行和不以该月的最后一天结束的行。希望这很清楚。

【问题讨论】：

您能否详细说明“包含一个月中所有日期的值”？你的意思是一个月中的每一天都有数据吗？
是的，一个月中的每一天都有数据。因此，如果数据从 2013-3-13 开始，则子集数据应该从下个月开始。假设数据在开始日期之后是连续的。
如果“不完整数据”是指 NAN，则可以删除具有 NAN 值的行。它不能解决你的问题吗？
没有 NaN。有人给我这个数据。它从某个月的中旬开始，到另一个月的中旬结束。我想从第 2 月初和第 11 月底开始对数据进行子集化。

标签： python pandas dataframe datetime

【解决方案1】：

当您说“更好”的解决方案时 - 我假设您的意思是根据输入数据使范围动态化。

好的，因为您提到您的数据在开始日期之后是连续的 - 可以安全地假设日期按升序排序。考虑到这一点，请考虑以下代码：

import pandas as pd
import numpy as np  
from datetime import date, timedelta

dates = pd.date_range("20100121", periods=3653)
df = pd.DataFrame(np.random.randn(3653, 1), index=dates, columns=list("A"))
print(df)
dfb=df.resample('B').apply(lambda x:x[-1])

# fd is the first index in your dataframe
fd = df.index[0]
first_day_of_next_month = fd
# checks if the first month data is incomplete, i.e. does not start with date = 1
if ( fd.day != 1 ):
   new_month = fd.month + 1
   if ( fd.month == 12 ):
      new_month = 1
   first_day_of_next_month = fd.replace(day=1).replace(month=new_month)
else:
   first_day_of_next_month = fd

# ld is the last index in your dataframe
ld = df.index[-1]
# computes the next day
next_day = ld + timedelta(days=1)
if ( next_day.month > ld.month ):
   last_day_of_prev_month = ld  # keeps the index if month is changed
else:
   last_day_of_prev_month = ld.replace(day=1) - timedelta(days=1)


df_out=dfb[first_day_of_next_month:last_day_of_prev_month]

还有另一种使用dateutil.relativedelta 的方法，但您需要安装python-dateutil 模块。上述解决方案试图在不使用任何额外模块的情况下做到这一点。

【讨论】：

【解决方案2】：

我假设在一般情况下，表格是按时间顺序排列的（如果不使用 .sort_index）。想法是从日期中提取年月，只选择（年，月）不等于第一行和最后一行的行。

dfb['year'] = dfb.index.year  # col#1
dfb['month'] = dfb.index.month  # col#2

first_month = (dfb['year']==dfb.iloc[0, 1])  & (dfb['month']==dfb.iloc[0, 2])   
last_month  = (dfb['year']==dfb.iloc[-1, 1]) & (dfb['month']==dfb.iloc[-1, 2]) 

dfb = dfb.loc[(~first_month) & (~last_month)]
dfb = dfb.drop(['year', 'month'], axis=1)

【讨论】：