【问题标题】:dplyr and previous observationsdplyr 和以前的观察
【发布时间】:2015-04-14 14:22:44
【问题描述】:

我需要对每个唯一标识符运行一堆线性模型,但首先我需要进行检查。对于每个唯一的 id 和年份,我需要检查至少有 24 个月的以前的月度数据,但不超过 60 个月。因此,当我运行回归时,它应该包括对每个人每年的前一个月(年)数据的 24 - 60 次观察。如果该年的数据少于 24 个月,则删除该个人的年份,但如果超过 60 个月,则仅使用 60 个月。

感谢this(感谢@akrun)的帖子,我能够为每个人设置线性模型,运行它们,然后将 beta 输出为两个 beta 的总和。问题是这只对当前年份(12 obs)而不是之前的 24-60 运行回归。

编辑:我意识到输入是错误的......对不起

单尖输出:

    tdata <- structure(list(cusip = c(101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 101L, 
101L, 101L, 101L), date = c(19901130L, 19901031L, 19900928L, 
19900831L, 19900731L, 19900629L, 19900531L, 19900430L, 19900330L, 
19900228L, 19900131L, 19891229L, 19891130L, 19891031L, 19890929L, 
19890831L, 19890731L, 19890630L, 19890531L, 19890428L, 19890331L, 
19890228L, 19890131L, 19881230L, 19881130L, 19881031L, 19880930L, 
19880831L, 19880729L, 19880630L, 19880531L, 19880429L, 19880331L, 
19880229L, 19880129L, 19871231L, 19871130L, 19871030L, 19870930L, 
19870831L, 19870731L, 19870630L, 19870529L, 19870430L, 19870331L, 
19870227L, 19870130L, 19861231L, 19861128L, 19861031L, 19860930L, 
19860829L, 19860731L), fyear = c("1990", "1990", "1990", "1990", 
"1990", "1990", "1990", "1990", "1990", "1990", "1990", "1989", 
"1989", "1989", "1989", "1989", "1989", "1989", "1989", "1989", 
"1989", "1989", "1989", "1988", "1988", "1988", "1988", "1988", 
"1988", "1988", "1988", "1988", "1988", "1988", "1988", "1987", 
"1987", "1987", "1987", "1987", "1987", "1987", "1987", "1987", 
"1987", "1987", "1987", "1986", "1986", "1986", "1986", "1986", 
"1986"), month = c("11", "10", "09", "08", "07", "06", "05", 
"04", "03", "02", "01", "12", "11", "10", "09", "08", "07", "06", 
"05", "04", "03", "02", "01", "12", "11", "10", "09", "08", "07", 
"06", "05", "04", "03", "02", "01", "12", "11", "10", "09", "08", 
"07", "06", "05", "04", "03", "02", "01", "12", "11", "10", "09", 
"08", "07"), ret = c("0.117647", "0.030303", "-0.161017", "-0.186207", 
"-0.131737", "0.128378", "0.027778", "-0.162791", "0.131579", 
"0.178295", "-0.091549", "0.163934", "-0.089552", "0.007519", 
"0.117647", "0.155340", "0.211765", "0.024096", "0.338710", "0.377778", 
"0.071429", "-0.176471", "0.378378", "-0.026316", "-0.050000", 
"-0.047619", "-0.086957", "-0.061224", "0.088889", "-0.062500", 
"-0.040000", "-0.056604", "0.081633", "0.042553", "-0.096154", 
"0.238095", "-0.263158", "-0.393617", "-0.160714", "0.400000", 
"-0.090909", "-0.200000", "-0.098361", "-0.152778", "0.000000", 
"0.107692", "0.460674", "-0.101010", "-0.019802", "0.246914", 
"-0.052632", "0.179310", "-0.064516"), ewretd = c(0.035468, -0.057155, 
-0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117, 
0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934, 
0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089, 
0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485, 
0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016, 
0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876, 
0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283, 
0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071, 
0.023269, -0.073128), lagewretd = c(-0.004526, 0.035468, -0.057155, 
-0.080468, -0.108911, -0.025732, 0.005359, 0.045675, -0.028117, 
0.021315, 0.015434, -0.046408, -0.012375, -0.0058, -0.049934, 
0.005532, 0.018626, 0.031017, -0.007744, 0.025054, 0.029089, 
0.01806, 0.002988, 0.062124, 0.018872, -0.036484, -0.011485, 
0.016951, -0.025001, 0.000289, 0.047677, -0.017671, 0.014016, 
0.03569, 0.060265, 0.077392, 0.026065, -0.05085, -0.272248, -0.015876, 
0.014544, 0.035123, 0.021487, 0.000573, -0.017709, 0.036283, 
0.074612, 0.117565, -0.034609, -0.006263, 0.023777, -0.059071, 
0.023269)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-53L), .Names = c("cusip", "date", "fyear", "month", "ret", "ewretd", 
"lagewretd"))

dplyr 代码:

res1 <- tdata %>%  
  group_by(cusip, fyear) %>% 
  arrange(desc(date)) %>% 
  mutate(n=n()) %>%
  do(data.frame(., beta=ifelse(.$n > 2,
   sum(coef(lm(ret~ewretd+lagewretd, data=.))[-1]), NA)))

更新 2:2015 年 4 月 13 日

这是一个for 循环,我认为它可以解决问题,但同样,R 中的for 循环并不是最有效的解决方案。

for (i : unique(cusip)){
  for (j : unique(fyear)){
    check <- filter(tdata, fyear == i & fyear == i-1 & fyear == i-2 & fyear == i-3 & fyear == i-4)
    ifelse(length(check$month < 24), tdata$beta == NA, if(length(check$month >= 60)){
                                                         arrange(check, desc(date)),
                                                         filter(check, month[1:60,]),
                                                         check$beta <- sum(coef(lm(ret~ewretd+lagewretd, data = check))[-1])), 
                                                         left_join(tdata, check, by=c("cusip", fyear == j))}

更新 3:完整样本集

这包括所有 obs,相当大 (323mb)

Full Sample

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    从长远来看,您可能希望使用正确的日期。通过将fyear 从字符转换为整数,我朝着这个方向迈出了一小步。

    library(dplyr)
    
    ## convert fyear to a proper number and then exploit for sorting
    tdata <- tdata %>%
      mutate(fyear = fyear %>% as.integer) %>%
      arrange(fyear, month)
    

    然后我在fyear 的级别上汇总了tbl,计算您有多少月的累积数据可用于拟合模型。 (我正在拖动cusip,但由于您的数据只包含一个cusip,我无法确定这一切是否正常。)

    ## figure out cumulative months available for each year (for each cusip)
    yearstuff <- tdata %>%  
      group_by(cusip, fyear) %>% 
      summarize(n = n()) %>% 
      mutate(n_cum = cumsum(n))
    yearstuff
    # Source: local data frame [5 x 4]
    # Groups: cusip
    # 
    #   cusip fyear  n n_cum
    # 1   101  1986  6     6
    # 2   101  1987 12    18
    # 3   101  1988 12    30
    # 4   101  1989 12    42
    # 5   101  1990 11    53
    

    我认为模型拟合对于dplyr 来说并不是一项非常自然的任务,因为它不适合group_by 范式。相反,我使用plyr::ddply() 将事情从yearstuff 中删除,并为每个cusip * fyear 组合提取我需要的数据。如果数据不足,我拒绝拟合模型,如果数据太多,我只取最近 60 个月的数据。

    ## iterate over rows of yearstuff (for each cusip)
    models <- plyr::ddply(yearstuff, ~ cusip + fyear, function(y) {
      if(y$n_cum < 24) {
        c('(Intercept)' = NA_real_, ewretd = NA_real_, lagewretd = NA_real_)
      } else {
        my_dat <- tdata %>%
          filter(cusip == y$cusip, fyear <= y$fyear) %>%
          mutate(rn = row_number(desc(date)))
        lm(ret ~ ewretd + lagewretd, my_dat, subset = rn < 61) %>% coef
      }
    })
    models
    #   cusip fyear (Intercept)   ewretd  lagewretd
    # 1   101  1986          NA       NA         NA
    # 2   101  1987          NA       NA         NA
    # 3   101  1988 -0.01138861 1.614342 0.14885911
    # 4   101  1989  0.02467139 1.878295 0.00598857
    # 5   101  1990  0.02529068 1.900389 0.05766020
    

    这使您可以根据需要使用估计的系数。我认为这应该扩展到多个cusips,但谁知道呢?此外,此数据集不包含超过 60 个月。您显然应该“手动”对这些结果进行一些抽查!

    【讨论】:

    • 谢谢...我在您的第一步中遇到的问题是,因为一年没有 24-60 个月的先前观察意味着它不应该被排除在外,因为多年前可能取决于这些观察。这就是为什么需要在应用回归之前进行检查的原因......如果你想破解它,我已经用完整的数据集更新了答案,我会欣赏它:)
    • 你说的第一步是指yearstuff的形成还是if里面ddply()的部分?我认为(至少)我们中的一个人误解了另一个人。我循环遍历cusipfyear 的独特组合,并从tdata 中获取cusip 的所有数据,直到并包括fyear。如果有足够的,我适合模型(但不要使用超过 60 个月)。一年被包括或排除在全球范围内是没有意义的。这取决于上下文。
    • 好的,我知道现在发生了什么。对于 1.1m obs,这将需要很长时间。有什么办法可以加快速度?这就是我在dplyr工作的原因
    • 在完整的数据集上运行了几个小时后,我收到了这条消息Error in do.ply(i) : task 98081 failed - "0 (non-NA) cases" 知道吗?感谢您的帮助。
    • 哦,没有意识到完整数据集有多大。你会想要非常防御性地编码。随时将内容写入文件和/或将 lm 调用放入 try 中,因此单个错误不会让您一无所获。那么也许你可以等待 plyr 出来。但如果这不是一次性的事情,那么速度很重要。也许您应该寻找 data.table 解决方案?想知道它是否可以表达您对每月数据的滚动使用情况?
    猜你喜欢
    • 2018-08-27
    • 2016-08-01
    • 1970-01-01
    • 1970-01-01
    • 2018-08-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多