设置数据
首先以可重现的方式创建数据框:
import datetime as dt
import pandas as pd
# provided data
data = [('2019-08-23', '10'), ('2019-06-23', '18'),('2019-07-21', '05'),
('2019-09-09', '09'), ('2019-09-19', '04'), ('2019-08-27', '22'),
('2019-05-03', '02'), ('2019-06-27', '07'), ('2019-05-25', '19'),
('2019-04-27', '02'), ('2019-01-19', '02'), ('2019-05-28', '10'),
('2019-02-22', '09'), ('2019-01-25', '06'), ('2019-10-22', '17'),
('2019-11-02', '13'), ('2019-10-29', '17'), ('2019-03-11', '18'),
('2019-03-11', '19'), ('2019-10-19', '19'), ('2019-02-17', '12'),
('2019-10-21', '01'), ('2019-09-01', '08'), ('2019-01-15', '09'),
('2019-11-15', '08'), ('2019-10-10', '18'), ('2019-03-31', '01'),
('2019-08-17', '01'), ('2019-05-27', '07'), ('2019-02-24', '20'),
('2019-11-03', '21'), ('2019-06-28', '21'), ('2019-01-06', '00'),
('2019-03-30', '23'), ('2019-06-27', '04'), ('2019-03-08', '19'),
('2019-01-30', '09'), ('2019-11-15', '02'), ('2019-06-04', '09'),
('2019-05-03', '14'), ('2019-07-01', '08'), ('2019-09-20', '19'),
('2019-05-15', '12'), ('2019-05-17', '02'), ('2019-09-21', '20'),
('2019-02-14', '14')]
# create df
df = pd.DataFrame.from_records(data, columns=('date', 'amount'))
看起来您正在使用 object 数据类型 - 使用正确的数据类型,此操作会容易得多:
# convert dtypes
df['date'] = pd.to_datetime(df['date'], errors='coerce')
df['amount'] = df['amount'].astype('int')
为了可视化我们正在查看的内容,我对数据进行了排序,以便更轻松地评估结果
df = df.sort_values(['date', 'amount']).reset_index(drop=True)
df.head()
日期金额
0 2019-01-06 0
1 2019-01-15 9
2 2019-01-19 2
3 2019-01-25 6
4 2019-01-30 9
取回数据
推荐
获取数据帧的集合/列表/字典可能会变得有点混乱,因此您可能想考虑一下这是否是一个真正的要求。如果没有,您可以通过访问 df['date'].dt 以多种方式从单个数据帧中过滤 ad-hoc:
# getting things in a certain month
mar_df = df[df['date'].dt.month == 3] # only filtered on month
mar_df = df[(df['date'].dt.month == 3) & (df['date'].dt.year == 2019)] # month & year
# getting values in a range of months
mar_jul_df = df[df['date'].dt.month.between(3, 7)]
mar_jul_df = df[(df['date'].dt.year == 2019) & (df['date'].dt.month.between(3, 7))]
# getting values between two dates
mar_jul_df = df[(df['date'] >= dt.datetime(2019, 3, 1)) & (df['date'] <= dt.datetime(2019, 7, 31))]
这样做,您将能够根据需要收集过滤后的数据帧,并且具有更多的控制权和可能的可读性。这不包括您所需数据可能从 2018 年 12 月开始到 2019 年 4 月结束的情况。
获取日期范围允许我们获取我们正在寻找的上限和下限,或者指定频率内的日期范围,这使得这更加灵活。
# getting upper and lower bounds
>>> start_stop_date = pd.date_range(end=dt.datetime(2019, 8, 1), freq='5MS', periods=2)
>>> start_stop_date
DatetimeIndex(['2019-03-01', '2019-08-01'], dtype='datetime64[ns]', freq='5MS')
使用这个,我们可以使用这个列表过滤值
# setting two conditions -- on or after start & before end
mar_jul_df = df[(df['date'] >= start_stop_date[0]) & (df['date'] < start_stop_date[1])]
# modifying boundaries to exclude 2019-08-01
start_stop_date[1] = start_stop_date[1] - dt.timedelta(days=1)
mar_jul_df = df[df['date'].between(start_stop_date[0], start_stop_date[1])]
数据框集
最简单的情况
如果您的解决方案需要返回五个单独的数据框,那么最简单的解决方案可能是对感兴趣的月份使用列表理解如果您的数据范围总是在同一年:
# list comprehension
df_list = [df[df['date'].dt.month == mo] for mo in range(3, 8)]
# returning individual dfs
mar_df, apr_df, may_df, jun_df, jul_df = iter(df_list)
现实案例
在这个简单的案例之外,您需要使用pd.date_range。
# getting range of dates
>>> boundary_dates = pd.date_range(end=dt.datetime(2019, 8, 1), freq='MS', periods=6)
>>> boundary_dates
DatetimeIndex(['2019-03-01', '2019-04-01', '2019-05-01', '2019-06-01', '2019-07-01', '2019-08-01'],
dtype='datetime64[ns]', freq='MS')
这为您提供了六个日期范围,可以得出 5 组边界。您可以使用zip 创建一个边界列表:
>>> [[l_bound, u_bound] for l_bound, u_bound in zip(boundary_dates, boundary_dates[1:])]
[[Timestamp('2019-03-01 00:00:00', freq='MS'), Timestamp('2019-04-01 00:00:00', freq='MS')],
[Timestamp('2019-04-01 00:00:00', freq='MS'), Timestamp('2019-05-01 00:00:00', freq='MS')],
[Timestamp('2019-05-01 00:00:00', freq='MS'), Timestamp('2019-06-01 00:00:00', freq='MS')],
[Timestamp('2019-06-01 00:00:00', freq='MS'), Timestamp('2019-07-01 00:00:00', freq='MS')],
[Timestamp('2019-07-01 00:00:00', freq='MS'), Timestamp('2019-08-01 00:00:00', freq='MS')]]
要利用pd.Series.between,请再次减去dt.timedelta(days=1)。
boundaries = [[l_bound, u_bound - dt.timedelta(days=1)] for
l_bound, u_bound in zip(boundary_dates, boundary_dates[1:])]
df_list = [df[df['date'].between(b) for b in boundaries]
mar_df, apr_df, may_df, jun_df, jul_df = iter(df_list)
由于您需要动态的东西,因此您不会希望每次都为每个数据帧指定名称。将其作为字典返回允许将数据框分配给一个键(来自dt.datetime.strftime,以便更容易地将其拉出:
df_dict = {b[0].strftime('%b_%y_df'):
{df[df['date'].between(b[0], b[1])] for b in boundaries}
您仍然可以使用 df_dict.values() 轻松访问各个数据帧,因为每个值都包含一个数据帧。
创建函数
要将这些步骤封装在一个函数中,让您可以灵活地查看您正在查看的年份和月份,以及您希望返回的月数:
def monthly_dfs(df, year, month, n=5):
"""return a number of dataframes for the n months preceding a given month"""
# generate list of boundaries for months of interest
before_dt = dt.datetime(year, month, 1)
boundary_dates = pd.date_range(end=before_dt, freq='MS', periods=n+1)
# get boundary pairs
boundaries = [[l_bound, u_bound - dt.timedelta(days=1)] for
l_bound, u_bound in zip(boundary_dates, boundary_dates[1:])]
# return df within each boundary pair with key according to month start
return {b[0].strftime('%b_%y_df'):
df[df['date'].between(b[0], b[1])] for b in boundaries}
df_dict = monthly_dfs(df, 2019, 8)
mar_df, apr_df, may_df, jun_df, jul_df = df_dict.values()
输出
重新格式化了一下,这里是df_dict:
{
'Mar_19_df':
date amount
9 2019-03-08 19
10 2019-03-11 18
11 2019-03-11 19
12 2019-03-30 23
13 2019-03-31 1,
'Apr_19_df':
date amount
14 2019-04-27 2,
'May_19_df':
date amount
15 2019-05-03 2
16 2019-05-03 14
17 2019-05-15 12
18 2019-05-17 2
19 2019-05-25 19
20 2019-05-27 7
21 2019-05-28 10,
'Jun_19_df':
date amount
22 2019-06-04 9
23 2019-06-23 18
24 2019-06-27 4
25 2019-06-27 7
26 2019-06-28 21,
'Jul_19_df':
date amount
27 2019-07-01 8
28 2019-07-21 5
}
这些可以使用创建的键来访问,例如:
>>>df_dict['Mar_19_df']
date amount
9 2019-03-08 19
10 2019-03-11 18
11 2019-03-11 19
12 2019-03-30 23
13 2019-03-31 1