【发布时间】:2014-11-26 23:14:53
【问题描述】:
我有两个如下所示的数据集:
我想做的是在“数据”数据框上过滤掉非交易日。我假设它将每行的 data.index.date 与 trading_days 的 data.index.date 进行比较,然后在匹配时返回该行。如果不匹配,则不是交易日,不返回该行。这有效地过滤掉了非交易日的数据集。
但是,在这里逐行检查两个 data.index.dates 是否相等,使用 apply() 函数返回行似乎效率低下 - 我觉得有一种更有效的方法可以做到这一点,因为我将在 180M 行数据帧上执行此操作。
是否有某种“合并”或“加入”,例如:
data.join(trading_days)
只过滤 date.index.date 匹配的日期?我需要按分钟级别获取所有信息(如“数据”数据框中所示),但只需过滤掉非交易日期。感谢您的帮助!
更新以包含值(如果有更好的粘贴方法,请告诉我):
In[5]: data.head(30).values
Out[6]:
array([[ 438.9, 438.9, 438.9, 438.9, 0. ],
[ 438.9, 438.9, 438.7, 438.7, 31. ],
[ 438.6, 438.6, 438.6, 438.6, 7. ],
[ 438.4, 438.7, 438.4, 438.4, 4. ],
[ 438.4, 438.4, 438.3, 438.3, 4. ],
[ 438.2, 438.2, 438.2, 438.2, 1. ],
[ 438.2, 438.2, 438.2, 438.2, 0. ],
[ 438.2, 438.2, 438.2, 438.2, 1. ],
[ 438.2, 438.2, 438.2, 438.2, 0. ],
[ 438.1, 438.1, 438.1, 438.1, 3. ],
[ 438. , 438. , 437.9, 438. , 6. ],
[ 438. , 438.2, 438. , 438. , 8. ],
[ 438.2, 438.2, 438.1, 438.1, 6. ],
[ 438.1, 438.1, 438.1, 438.1, 4. ],
[ 438.1, 438.1, 438.1, 438.1, 0. ],
[ 438.3, 438.3, 438.3, 438.3, 1. ],
[ 438.3, 438.3, 438.3, 438.3, 0. ],
[ 438.3, 438.3, 438.3, 438.3, 0. ],
[ 438.1, 438.1, 438.1, 438.1, 1. ],
[ 438. , 438. , 437.9, 437.9, 54. ],
[ 437.8, 437.8, 437.8, 437.8, 10. ],
[ 437.8, 437.8, 437.8, 437.8, 1. ],
[ 437.8, 437.8, 437.8, 437.8, 6. ],
[ 437.8, 437.8, 437.8, 437.8, 0. ],
[ 437.9, 438. , 437.9, 438. , 12. ],
[ 437.9, 438. , 437.9, 438. , 0. ],
[ 437.9, 438. , 437.9, 438. , 0. ],
[ 437.9, 438. , 437.9, 438. , 0. ],
[ 437.9, 437.9, 437.9, 437.9, 1. ],
[ 437.9, 437.9, 437.8, 437.8, 4. ]])
以下是时间戳:
In[10]: data.head(30).index.values
Out[11]:
array(['2005-01-02T13:59:00.000000000-0500',
'2005-01-02T14:00:00.000000000-0500',
'2005-01-02T14:01:00.000000000-0500',
'2005-01-02T14:02:00.000000000-0500',
'2005-01-02T14:03:00.000000000-0500',
'2005-01-02T14:04:00.000000000-0500',
'2005-01-02T14:05:00.000000000-0500',
'2005-01-02T14:06:00.000000000-0500',
'2005-01-02T14:07:00.000000000-0500',
'2005-01-02T14:08:00.000000000-0500',
'2005-01-02T14:09:00.000000000-0500',
'2005-01-02T14:10:00.000000000-0500',
'2005-01-02T14:11:00.000000000-0500',
'2005-01-02T14:12:00.000000000-0500',
'2005-01-02T14:13:00.000000000-0500',
'2005-01-02T14:14:00.000000000-0500',
'2005-01-02T14:15:00.000000000-0500',
'2005-01-02T14:16:00.000000000-0500',
'2005-01-02T14:17:00.000000000-0500',
'2005-01-02T14:18:00.000000000-0500',
'2005-01-02T14:19:00.000000000-0500',
'2005-01-02T14:20:00.000000000-0500',
'2005-01-02T14:21:00.000000000-0500',
'2005-01-02T14:22:00.000000000-0500',
'2005-01-02T14:23:00.000000000-0500',
'2005-01-02T14:24:00.000000000-0500',
'2005-01-02T14:25:00.000000000-0500',
'2005-01-02T14:26:00.000000000-0500',
'2005-01-02T14:27:00.000000000-0500',
'2005-01-02T14:28:00.000000000-0500'], dtype='datetime64[ns]')
trading_days 是来自这里的 read.csv:http://pastebin.com/5N01Gi5V
第二次更新:
【问题讨论】:
-
您能以纯文本形式发布一些示例数据吗?