【发布时间】:2017-03-14 20:07:18
【问题描述】:
我有以下代码来清理数据集。
data1 <- data1 %>%
mutate(YEAR = year(DATE),
MONTH = month(DATE),
DAY=day(DATE),
HOUR=hour(TIME),
MINUTE = minute(TIME),
RETURN= ((PRICE-lag(PRICE))/lag(PRICE))
) %>%
filter(HOUR >= 9, (HOUR <= 16 & MINUTE <=61)) %>%
group_by(MINUTE, HOUR, DAY, MONTH, YEAR) %>%
summarize(AV.PRICE = mean(PRICE, na.rm=TRUE),
SUM.SIZE=sum(SIZE, na.rm=TRUE),
RV=sum(RET^2)) %>%
arrange(YEAR, MONTH, DAY, HOUR, MINUTE) %>%
mutate(DATETIME = as.POSIXct(
paste(YEAR,"/",MONTH,"/", DAY, " ", HOUR,":", MINUTE,":00",sep=""),
format="%Y/%m/%d %H:%M:%S", origin = "1970-01-01")
)
但是,它有时会给我错误消息:Error: 'origin' must be supplied
奇怪的是,我在会话中第一次运行此代码时并没有出现该错误,而是在随后的运行中出现。如果我重新启动会话,问题会消失一次,并在以后的运行中返回。因此,我必须始终重新启动才能使其正常工作。
我检查了这个问题:How to solve: "Error in as.POSIXct.numeric(X[[2L]], ...) : 'origin' must be supplied",这表明它可能是因为它正在从整数转换为时间。然而,glimpse 的数据表明 DATE 是<date> 类而不是整数。
为了安全起见:我遵循了错误的建议,并在所有处理日期的函数中添加了一个 origin = "1970-01-01" 参数:
data1 <- data1 %>%
mutate(YEAR = year(DATE, origin = "1970-01-01"),
MONTH = month(DATE, origin = "1970-01-01"),
DAY=day(DATE, origin = "1970-01-01"),
HOUR=hour(TIME, origin = "1970-01-01"),
MINUTE = minute(TIME, origin = "1970-01-01"),
RET= ((PRICE-lag(PRICE))/lag(PRICE))
) %>%
filter(HOUR >= 9, (HOUR <= 16 & MINUTE <=61)) %>%
group_by(MINUTE,HOUR,DAY,MONTH,YEAR) %>%
summarize(AV.PRICE = mean(PRICE, na.rm=TRUE),
SUM.SIZE=sum(SIZE, na.rm=TRUE),
RV=sum(RET^2)
) %>%
arrange(YEAR, MONTH, DAY, HOUR, MINUTE) %>%
mutate(DATETIME = as.POSIXct(
paste(YEAR,"/",MONTH,"/", DAY, " ", HOUR,":", MINUTE,":00",sep=""),
format="%Y/%m/%d %H:%M:%S", origin = "1970-01-01")
)
它返回Error: unused argument (origin = "1970-01-01")
如果有帮助,这里是我的数据集的一瞥:
Observations: 146,016,609
Variables: 4
$ DATE <date> 2008-01-02, 2008-01-02, 2008-01-02, 2008-01-02, 2008-01-02, 2008-01-02, 2008-01-02, ...
$ TIME <S4: Period> 9H 0M 4S, 9H 0M 4S, 9H 0M 4S, 9H 0M 4S, 9H 0M 4S, 9H 0M 4S, 9H 0M 4S, 9H 0M 4S...
$ PRICE <dbl> 146.86, 146.86, 146.86, 146.86, 146.86, 146.86, 146.86, 146.86, 146.86, 146.86, 146.8...
$ SIZE <int> 1000, 1000, 1000, 500, 2400, 1000, 1000, 1000, 2500, 1000, 1000, 400, 1000, 1000, 100...
我正在寻找使用基本包函数或最多 lubridate/dplyr 的答案。谢谢!
【问题讨论】: