【问题标题】:R check and count Strings in a vector, group_by, considering order of appearance of the stringsR检查并计算向量group_by中的字符串,考虑字符串的出现顺序
【发布时间】:2021-02-15 02:17:18
【问题描述】:

数据采用以下格式,我必须使用日期对其进行分组。为方便起见,我将其显示为数字。

Msg <- c("Errors","Errors", "Start","Stop","Start","Stop","Errors","Errors","Start","Stop",
         "Stop" ,"Start","Errors","Start","Stop","Start" ,"Stop" ,
         "Errors", "Start","Errors","Stop", "Start", "LostControl","LostControl", "Errors",
         "Failed", "Stop", "Start","Failed","Stop","Stop","Start","Stop","Start","Error","Start",
         "Failed", "Stop")
Date <- c(11,11,11,11,11,11,11,12,12,12,12,12,12,14,14,14,14, 19,19,19,19,
        20,20,20,20,20,20,21,21,21,21,22,22,22,22,22,22,22)
data<- data.frame(Msg,Date)

我需要统计每个 START-STOP 周期中失败的次数,按日期汇总。
数据具有三种类型的消息。 ErrorsFailed 是两种失败消息,而 LostControl 不是失败。 条件是 Failed msg 在该 START-STOP 循环中不应有 LostControl msg。如果它前面只有 Errors,它就是失败。 另外,如果只找到“Errors”消息,也不会被视为失败。

编辑:在 Msg 向量中,如果找到两个开始或停止,则 START_STOP 循环是从极端开始到极端停止。如果 START 后面没有 STOP,则将其忽略。

编辑一行添加为 - (Msg =Stop, Date=20)

【问题讨论】:

    标签: r dataframe group-by count summarize


    【解决方案1】:

    我们可以修改我昨天在你的post 中写的那个函数。

    between_valid_anchors <- function(x, bgn = "Start", end = "Stop") {
      are_anchors <- x %in% c(bgn, end)
      xid <- seq_along(x)
      id <- xid[are_anchors]
      x <- x[are_anchors]
      start_pos <- id[which(x == bgn & c("", head(x, -1L)) %in% c("", end))]
      stop_pos <- id[which(x == end & c(tail(x, -1L), "") %in% c("", bgn))]
      if (length(start_pos) < 1L || length(stop_pos) < 1L)
        return(logical(length(xid)))
      xid %in% unlist(mapply(`:`, start_pos, stop_pos))
    }
    

    那么就

    library(dplyr)
    
    data %>% 
      group_by(Date) %>% 
      filter(between_valid_anchors(Msg)) %>% 
      summarise(Msg = sum(Msg %in% c("Err", "Errors", "Failed")))
    

    输出

    `summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 6 x 2
       Date   Msg
      <dbl> <int>
    1    11     0
    2    12     0
    3    14     0
    4    19     1
    5    21     1
    6    22     2
    

    更新

    您可以再添加一个过滤器以仅选择感兴趣的消息(即 Start、Stop、Failed、LostControl)。然后,只需对所有 Msg == "Failed" 求和,而不是 lag(Msg) == "LostControl"

    library(dplyr)
    
    data %>% 
      group_by(Date) %>% 
      filter(between_valid_anchors(Msg)) %>% 
      filter(Msg %in% c("Start", "Stop", "Failed", "LostControl")) %>% 
      summarise(Msg = sum(Msg == "Failed" & lag(Msg, default = "") != "LostControl"))
    

    输出

    `summarise()` ungrouping output (override with `.groups` argument)
    # A tibble: 7 x 2
       Date   Msg
      <dbl> <int>
    1    11     0
    2    12     0
    3    14     0
    4    19     0
    5    20     0
    6    21     1
    7    22     1
    

    【讨论】:

    • 谢谢 :) 不过在最后的summarise 中略有变化,因为我只想计算Failed。所以我将其修改为` %>% summarise(Msg = sum(Msg %in% c( "Failed"))) `。再次感谢。
    • 还有一个小问题。如果在 LostControl msg 出现之后出现 Failed msg,则不计为失败。也就是说,出现的顺序很重要。有没有办法考虑到这一点?我在数据框中添加了一行来显示这种情况(日期 20 - 添加了最后一行)
    猜你喜欢
    • 2019-09-30
    • 1970-01-01
    • 1970-01-01
    • 2011-04-29
    • 2014-04-24
    • 1970-01-01
    相关资源
    最近更新 更多