【问题标题】:Using data.table to identify all event occurence with condition of picking first occurence if in sequence使用 data.table 来识别所有事件的发生,条件是按顺序选择第一个发生的
【发布时间】:2016-09-06 02:13:03
【问题描述】:

我正在尝试识别事件的所有发生,如果按顺序重复则选择第一个事件。我可以标记和添加计数,但在事件发生变化后无法重置计数。

我的数据有大约 1M 行,有 30 个奇怪的 ID。我只添加了一个 ID,但我的数据中有 30 个奇数 ID。该表具有 ID、日期时间和状态。

状态是可以有多个值的事件-A,B,C...我关心的事件是针对B的。

我要添加三列-

Occurrence_B - 事件标志是 B

Count_B - 计算 event=B 的连续发生次数,并在事件更改时重置

Include_B - 显示该特定事件是第一次出现还是继续出现的标志

我将对 Include_B='new' 的数据进行子集化,以选择序列中的第一个匹配项。

ID  Date    Status  Occurrence_B    Count_B Include_B

A   7/28/15 12:00 AM    A   0   0   0

A   7/28/15 12:30 AM    A   0   0   0

A   7/30/15 12:00 AM    B   1   1   new

A   7/31/15 12:00 AM    B   1   2   continued

A   7/31/15 11:00 AM    B   1   3   continued

A   8/2/15 10:00 AM         B   0   0   0

A   8/3/15 12:00 AM         C   0   0   0

A   8/4/15 12:00 AM         B   1   1   new

A   8/5/15 12:00 AM         B   1   2   continued

A   8/6/15 12:00 AM         A   1   0   continued

A   8/7/15 12:00 AM         B   1   1   new

table_picture

我的示例代码--

d1[, Occurrence_B:=Status %in% c('B')+0L]

d1[, Count_B := cumsum(Occurrence_B), by=.(ID,Status)]

问题是我不知道一旦事件发生变化如何重置 count_B。我正在尝试调查,但我是 data.table 的新手,因此非常感谢任何帮助。

如果您有任何问题,请告诉我。

SK

【问题讨论】:

    标签: r data.table cumsum


    【解决方案1】:

    你可以试试这样的:

    # create Occurrence_B column and initialize Include_B as NA
    (d1[, `:=` (Occurrence_B = as.integer(Status == "B"), Include_B = NA_character_)]
    
      # calculate Count_B use rleid(Occurrence_B) as group variable which will group consecutive
      # same values together
      [, Count_B := cumsum(Occurrence_B), by = rleid(Occurrence_B)]
    
      # Update the Include_B variable in place based on Count_B, when Count_B == 1, it appears 
      # the first time, when Count_B > 1, it is continued, otherwise keep them as NA
      [Count_B == 1, Include_B := "new"][Count_B > 1, Include_B := "continued"][])
    
    # ID                Date Status Occurrence_B Count_B Include_B
    # 1:  A 7/28/15 12:00 AM      A            0       0        NA
    # 2:  A 7/28/15 12:30 AM      A            0       0        NA
    # 3:  A 7/30/15 12:00 AM      B            1       1       new
    # 4:  A 7/31/15 12:00 AM      B            1       2 continued
    # 5:  A 7/31/15 11:00 AM      B            1       3 continued
    # 6:  A  8/2/15 10:00 AM      B            1       4 continued
    # 7:  A  8/3/15 12:00 AM      C            0       0        NA
    # 8:  A  8/4/15 12:00 AM      B            1       1       new
    # 9:  A  8/5/15 12:00 AM      B            1       2 continued
    #10:  A  8/6/15 12:00 AM      A            0       0        NA
    #11:  A  8/7/15 12:00 AM      B            1       1       new
    

    【讨论】:

    • 也可以像DT[, rowid(rleid(Occurrence_B))*Occurrence_B]一样得到Count_B
    • 谢谢。有用:)。我需要检查 rleid 做了什么,并且我还学习了一种将语句与对数据表的单一引用相结合的技巧。我需要做的另一件事是——添加一个列来标记在每个“新”值的 include_B 列之前 3 天和之后 3 天的窗口时间内的所有记录。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-08-17
    • 1970-01-01
    • 1970-01-01
    • 2015-11-16
    • 2016-10-19
    • 2021-03-21
    • 1970-01-01
    相关资源
    最近更新 更多