【发布时间】:2014-03-07 04:04:58
【问题描述】:
Python 3.3.3 熊猫 0.12.0
总的来说,我对 Pandas 和 Python 比较陌生。我有一个使用日期时间索引的 DataFrame(随时间推移的工作时间)。我还有一个我休假的日期列表(也是日期时间格式)。
我已将 DataFrame 重新采样为 28 天的区块,我希望添加一个列,其中包含在该 28 天的区块中休假的天数。
我确信有一些很棒的方法可以通过列表理解来做到这一点,但我能想到的最好方法是使用几个 for 循环来生成一个列表,然后在 DataFrame 中使用该列表。
我希望有人能指点我以一种不那么笨重的方式来实现这一点。提前感谢您的任何建议。
num_vaca = []
for idx, month_start in enumerate(df.index):
counter = 0
for day in vaca_days:
if day >= month_start and day <= df.index[idx + 1]:
counter += 1
num_vaca.append(counter)
df['Vacation Days'] = num_vaca
这似乎行得通好的,我只是确定有更好的方法。
编辑:我没有很好地解释这个问题,我认为一些例子可能会有所帮助。
vaca_days = [datetime.datetime(2013, 1, 1, 0, 0),
datetime.datetime(2013, 1, 2, 0, 0),
datetime.datetime(2013, 1, 3, 0, 0),
datetime.datetime(2013, 2, 4, 0, 0),
datetime.datetime(2013, 2, 5, 0, 0),
datetime.datetime(2013, 2, 6, 0, 0),
datetime.datetime(2013, 1, 7, 0, 0),
datetime.datetime(2013, 1, 8, 0, 0),
datetime.datetime(2013, 1, 9, 0, 0)]
dr = pd.date_range('11/1/2012', periods = 6, freq = 'M')
df = pd.DataFrame(np.random.randn(len(dr)), index = dr, columns = ['Random'])
df
# Output:
Random
2012-11-30 -0.134878
2012-12-31 0.066865
2013-01-31 0.630906
2013-02-28 -0.512881
2013-03-31 0.361890
2013-04-30 -1.019923
# Code from above
num_vaca = []
for idx, month_start in enumerate(df.index):
counter = 0
for day in vaca_days:
if day >= month_start and day <= df.index[idx + 1]:
counter += 1
num_vaca.append(counter)
df['Vacation Days'] = num_vaca
df
# Output
Random Vacation Days
2012-11-30 -0.134878 0
2012-12-31 0.066865 6
2013-01-31 0.630906 3
2013-02-28 -0.512881 0
2013-03-31 0.361890 0
2013-04-30 -1.019923 0
在处理最后一个索引值和[idx + 1] 时仍然存在一些问题,因为我还没有定义该时间段应该在哪里停止。
2014 年 2 月 10 日更新
我能够得到@Andy-Hayden 对工作的回应,只是有点肘部油脂。以下是我遇到的问题,以及我能找到的最佳解决方法。
#! /usr/bin/env python3
import pandas as pd
import numpy as np
import datetime
# Generate a DataFrame with some random data and 28d date periods
dr = pd.date_range('11/6/2012', periods = 12, freq = 'W-Tue')
df = pd.DataFrame(np.random.randn(len(dr)), index = dr, columns = ['column1'])
df = df.resample(rule = '28D', how = 'mean')
# The start dates for each period
df
# column1
# 2012-11-06 -0.134757
# 2012-12-04 -0.382133
# 2013-01-01 -0.468343
# 2013-01-29 0.000000
# [4 rows x 1 columns]
# Vacation dates
vaca_days = list(pd.date_range(start = '2012-12-1', periods = 12, freq = 'D'))
vaca_days
# Timestamp('2012-12-01 00:00:00', tz=None),
# Timestamp('2012-12-02 00:00:00', tz=None),
# Timestamp('2012-12-03 00:00:00', tz=None),
# Timestamp('2012-12-04 00:00:00', tz=None),
# Timestamp('2012-12-05 00:00:00', tz=None),
# Timestamp('2012-12-06 00:00:00', tz=None),
# Timestamp('2012-12-07 00:00:00', tz=None),
# Timestamp('2012-12-08 00:00:00', tz=None),
# Timestamp('2012-12-09 00:00:00', tz=None),
# Timestamp('2012-12-10 00:00:00', tz=None),
# Timestamp('2012-12-11 00:00:00', tz=None),
# Timestamp('2012-12-12 00:00:00', tz=None)]
# Clearly there should be 3 vacation dates in the 2012-11-06 period
# and 9 vacation dates in the 2013-12-04 period
# ---------------------------
# This gives me wrong numbers, because it tries to start the period on vaca_days[0]
pd.Series(1, vaca_days).resample('28D', how='sum')
# 2012-12-01 12
# 2012-12-29 0
# Freq: 28D, dtype: int64
# No help here
pd.Series(1, vaca_days).resample('28D', how='sum').shift(-1)
# 2012-12-01 0
# 2012-12-29 NaN
# Freq: 28D, dtype: float64
# The only workaround I could figure out: Append on the first start day
vaca_days.append(pd.Timestamp('2012-11-06'))
vaca_series = pd.Series([0] + [1 for each in vaca_days[:-1]], sorted(vaca_days))
vaca_series = vaca_series.resample(rule = '28D', how = sum)
vaca_series
# 2012-11-06 3
# 2012-12-04 9
# 2013-01-01 0
# Freq: 28D, dtype: int64
df['Vacation Days'] = vaca_series
df['Vacation Days'] = df['Vacation Days'].fillna(0)
df
# column1 Vacation Days
# 2012-11-06 -4.761727e-01 3
# 2012-12-04 5.715697e-01 9
# 2013-01-01 4.065451e-02 0
# 2013-01-29 2.781364e-309 0
# [4 rows x 2 columns]
【问题讨论】:
标签: python python-3.x pandas