【问题标题】:Forward fill a column's values by group after a specific value in another column在另一列中的特定值之后按组向前填充列的值
【发布时间】:2020-07-23 21:01:05
【问题描述】:

我有一个类似这样的数据表:

> dt
   FundId Period FundAssets
1       a 200601          0
2       a 200602          0
3       a 200603          0
4       a 200604   40000000
5       a 200605   45000000
6       a 200606   48000000
7       a 200607   52000000
8       a 200608   55000000
9       a 200609   57000000
10      a 200610   49000000
11      a 200611   16000000
12      a 200612    1500000
13      b 200601          0
14      b 200602          0
15      b 200603          0
16      b 200604   58000000
17      b 200605   24000000
18      b 200606   16000000
19      b 200607   57000000
20      b 200608          0
21      b 200609          0
22      b 200610          0
23      b 200611          0
24      b 200612          0
25      c 200601   57000000
26      c 200602   65000000
27      c 200603   70000000
28      c 200604   70000000
29      c 200605   78000000
30      c 200606   43000000
31      c 200607   56000000
32      c 200608   33000000
33      c 200609   23000000
34      c 200610   21000000
35      c 200611   24000000
36      c 200612   23000000

但是有更多的列和周期值。但是,这些是此问题的重要列。我正在尝试创建一个新列,作为基金是否存在并且资产是否达到 50,000,000 的标记。

我的想法是在 2 列中执行此操作:check1 和 check2。 check1 将查看每个基金的哪些时期拥有 50,000,000 或更多资产。我得到了这部分

dt[, check1 := dt[,FundAssets]>=50000000]

> dt
    FundId Period FundAssets check1
 1:      a 200601          0  FALSE
 2:      a 200602          0  FALSE
 3:      a 200603          0  FALSE
 4:      a 200604   40000000  FALSE
 5:      a 200605   45000000  FALSE
 6:      a 200606   48000000  FALSE
 7:      a 200607   52000000   TRUE
 8:      a 200608   55000000   TRUE
 9:      a 200609   57000000   TRUE
10:      a 200610   49000000  FALSE
11:      a 200611   16000000  FALSE
12:      a 200612    1500000  FALSE
13:      b 200601          0  FALSE
14:      b 200602          0  FALSE
15:      b 200603          0  FALSE
16:      b 200604   58000000   TRUE
17:      b 200605   24000000  FALSE
18:      b 200606   16000000  FALSE
19:      b 200607   57000000   TRUE
20:      b 200608          0  FALSE
21:      b 200609          0  FALSE
22:      b 200610          0  FALSE
23:      b 200611          0  FALSE
24:      b 200612          0  FALSE
25:      c 200601   57000000   TRUE
26:      c 200602   65000000   TRUE
27:      c 200603   70000000   TRUE
28:      c 200604   70000000   TRUE
29:      c 200605   78000000   TRUE
30:      c 200606   43000000  FALSE
31:      c 200607   56000000   TRUE
32:      c 200608   33000000  FALSE
33:      c 200609   23000000  FALSE
34:      c 200610   21000000  FALSE
35:      c 200611   24000000  FALSE
36:      c 200612   23000000  FALSE

check2 将成为在 check1 中的第一个 TRUE 之后且与 FundAssets>0 一样长的具有 TRUE 的列。但是,我在尝试将其余的 TRUE 填充到该列中时遇到了问题。基本上,最终的 dt 看起来像:

> dt
    FundId Period FundAssets check1 check2
 1:      a 200601          0  FALSE  FALSE
 2:      a 200602          0  FALSE  FALSE
 3:      a 200603          0  FALSE  FALSE
 4:      a 200604   40000000  FALSE  FALSE
 5:      a 200605   45000000  FALSE  FALSE
 6:      a 200606   48000000  FALSE  FALSE
 7:      a 200607   52000000   TRUE   TRUE
 8:      a 200608   55000000   TRUE   TRUE
 9:      a 200609   57000000   TRUE   TRUE
10:      a 200610   49000000  FALSE   TRUE
11:      a 200611   16000000  FALSE   TRUE
12:      a 200612    1500000  FALSE   TRUE
13:      b 200601          0  FALSE  FALSE
14:      b 200602          0  FALSE  FALSE
15:      b 200603          0  FALSE  FALSE
16:      b 200604   58000000   TRUE   TRUE
17:      b 200605   24000000  FALSE   TRUE
18:      b 200606   16000000  FALSE   TRUE
19:      b 200607   57000000   TRUE   TRUE
20:      b 200608          0  FALSE  FALSE
21:      b 200609          0  FALSE  FALSE
22:      b 200610          0  FALSE  FALSE
23:      b 200611          0  FALSE  FALSE
24:      b 200612          0  FALSE  FALSE
25:      c 200601   57000000   TRUE   TRUE
26:      c 200602   65000000   TRUE   TRUE
27:      c 200603   70000000   TRUE   TRUE
28:      c 200604   70000000   TRUE   TRUE
29:      c 200605   78000000   TRUE   TRUE
30:      c 200606   43000000  FALSE   TRUE
31:      c 200607   56000000   TRUE   TRUE
32:      c 200608   33000000  FALSE   TRUE
33:      c 200609   23000000  FALSE   TRUE
34:      c 200610   21000000  FALSE   TRUE
35:      c 200611   24000000  FALSE   TRUE
36:      c 200612   23000000  FALSE   TRUE

因此,我可以通过查看 check1 或 check2 在任何给定时间段内是否为 TRUE 来了解基金是否存在并在其历史上达到了 50,000,000 美元的资产。

也可以用 TRUE 填写 check1 的其余部分并消除对 check2 的需要。我看过前向填充功能,但它们似乎适用于 NA。首选使用 data.table 的答案。

【问题讨论】:

    标签: r data.table


    【解决方案1】:

    一个选项将是cummax,而在i 中有一个逻辑索引

    library(data.table)     
    setDT(dt)[, check1 := FundAssets > 50000000 # // create the check1
           ][, check2 := FALSE][ # // create the check2 as FALSE
          FundAssets != 0, # // specify the i with logical condition
            # // grouped by FundId, get the cummax of check1 convert to logical
             check2 := as.logical(cummax(check1)), FundId][]
    #   FundId Period FundAssets check1 check2
    # 1:      a 200601          0  FALSE  FALSE
    # 2:      a 200602          0  FALSE  FALSE
    # 3:      a 200603          0  FALSE  FALSE
    # 4:      a 200604   40000000  FALSE  FALSE
    # 5:      a 200605   45000000  FALSE  FALSE
    # 6:      a 200606   48000000  FALSE  FALSE
    # 7:      a 200607   52000000   TRUE   TRUE
    # 8:      a 200608   55000000   TRUE   TRUE
    # 9:      a 200609   57000000   TRUE   TRUE
    #10:      a 200610   49000000  FALSE   TRUE
    #11:      a 200611   16000000  FALSE   TRUE
    #12:      a 200612    1500000  FALSE   TRUE
    #13:      b 200601          0  FALSE  FALSE
    #14:      b 200602          0  FALSE  FALSE
    #15:      b 200603          0  FALSE  FALSE
    #16:      b 200604   58000000   TRUE   TRUE
    #17:      b 200605   24000000  FALSE   TRUE
    #18:      b 200606   16000000  FALSE   TRUE
    #19:      b 200607   57000000   TRUE   TRUE
    #20:      b 200608          0  FALSE  FALSE
    #21:      b 200609          0  FALSE  FALSE
    #22:      b 200610          0  FALSE  FALSE
    #23:      b 200611          0  FALSE  FALSE
    #24:      b 200612          0  FALSE  FALSE
    #25:      c 200601   57000000   TRUE   TRUE
    #26:      c 200602   65000000   TRUE   TRUE
    #27:      c 200603   70000000   TRUE   TRUE
    #28:      c 200604   70000000   TRUE   TRUE
    #29:      c 200605   78000000   TRUE   TRUE
    #30:      c 200606   43000000  FALSE   TRUE
    #31:      c 200607   56000000   TRUE   TRUE
    #32:      c 200608   33000000  FALSE   TRUE
    #33:      c 200609   23000000  FALSE   TRUE
    #34:      c 200610   21000000  FALSE   TRUE
    #35:      c 200611   24000000  FALSE   TRUE
    #36:      c 200612   23000000  FALSE   TRUE
    

    数据

    dt <- structure(list(FundId = c("a", "a", "a", "a", "a", "a", "a", 
    "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", 
    "b", "b", "b", "b", "c", "c", "c", "c", "c", "c", "c", "c", "c", 
    "c", "c", "c"), Period = c(200601L, 200602L, 200603L, 200604L, 
    200605L, 200606L, 200607L, 200608L, 200609L, 200610L, 200611L, 
    200612L, 200601L, 200602L, 200603L, 200604L, 200605L, 200606L, 
    200607L, 200608L, 200609L, 200610L, 200611L, 200612L, 200601L, 
    200602L, 200603L, 200604L, 200605L, 200606L, 200607L, 200608L, 
    200609L, 200610L, 200611L, 200612L), FundAssets = c(0L, 0L, 0L, 
    40000000L, 45000000L, 48000000L, 52000000L, 55000000L, 57000000L, 
    49000000L, 16000000L, 1500000L, 0L, 0L, 0L, 58000000L, 24000000L, 
    16000000L, 57000000L, 0L, 0L, 0L, 0L, 0L, 57000000L, 65000000L, 
    70000000L, 70000000L, 78000000L, 43000000L, 56000000L, 33000000L, 
    23000000L, 21000000L, 24000000L, 23000000L)), class = "data.frame", 
    row.names = c("1", 
    "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", 
    "14", "15", "16", "17", "18", "19", "20", "21", "22", "23", "24", 
    "25", "26", "27", "28", "29", "30", "31", "32", "33", "34", "35", 
    "36"))
    

    【讨论】:

    • 谢谢,这对我有用,并创建了我正在寻找的专栏!但是,它现在遇到了问题。当我对 data.table 执行新操作时,它会将此列中的所有值更改回 FALSE。当此列等于 TRUE 并且另一列等于 FALSE 时,我需要执行计数,但结果为 0,因为所有 check2 值都保持为 TRUE。我使用的计数是dt[otherColumn==FALSE &amp; check2==TRUE, .N, by=Period]。有什么办法可以锁定此栏或解决此问题?
    • @Wahoo 你能发个新问题吗
    • 我最终制作了 dt 的副本,它解决了这个问题。感谢您的帮助!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-06-05
    • 2019-11-21
    • 1970-01-01
    • 2022-01-12
    • 2019-10-26
    • 1970-01-01
    • 2020-01-26
    相关资源
    最近更新 更多