比较连续行并选择后续行是特定值答案

【问题标题】：Comparing consecutive rows and select rows where are subsequent is a specific value比较连续行并选择后续行是特定值
【发布时间】：2016-06-26 04:54:07
【问题描述】：

我有一个如下的数据框

structure(list(HospNum_Id = c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L), VisitDate = c("13/02/03", "13/04/05", "13/05/12", 
"13/12/06", "13/04/12", "13/05/13", "13/06/14", "13/04/15", "03/04/15", 
"04/05/16", "04/06/16"), EVENT = c("EMR", "RFA", "nothing", "nothing", 
"EMR", "nothing", "EMR", "EMR", "RFA", "EMR", "nothing")), .Names = c("HospNum_Id", 
"VisitDate", "EVENT"), class = "data.frame", row.names = c(NA, 
-11L))

我只想选择当前行EVENT 为“EMR"”的行，并且对于每个HospNum_Id，此行之前的行（按日期升序排列）为“无”。

我想要的输出是：

 HospNum_Id VisitDate EVENT
    2   13/12/06    nothing
    2   13/04/12    EMR
    2   13/05/13    nothing
    2   13/06/14    EMR

但我目前的输出是：

  HospNum_Id VisitDate EVENT
       (int)     (chr) (chr)
1          2  13/04/12   EMR
2          2  13/06/14   EMR
3          2  13/04/15   EMR

目前我有以下代码，但我认为它让我失望了，因为我在过滤器中使用 first 而不是一个短语，意思是 "before the row that has EMR in the EVENT":

Upstaging<-Therap %>% 
  arrange(HospNum_Id, as.Date(Therap$VisitDate, '%d/%m/%y')) %>% 
  group_by(HospNum_Id) %>% 
  filter(first(EVENT == "nothing") & EVENT == "EMR")

【问题讨论】：

标签： r

【解决方案1】：

我们可以使用data.table。将“data.frame”转换为“data.table”（setDT(df1)），按“HospNum_Id”分组，我们得到索引（“i1”），其中“EVENT”为“EMR”，前一个值为“nothing” .使用该索引获取前一个元素索引 ('i1-1') sort 并获取行索引 (.I)。这样，我们将行子集化。

library(data.table)
v1 <- setDT(df1)[,  {i1 <- which(EVENT == "EMR" & shift(EVENT)=="nothing")
              .I[sort(c(i1, i1-1))] } , by = HospNum_Id]$V1
df1[v1]
#   HospNum_Id VisitDate   EVENT
#1:          2  13/12/06 nothing
#2:          2  13/04/12     EMR
#3:          2  13/05/13 nothing
#4:          2  13/06/14     EMR

或使用dplyr 的类似方法。

library(dplyr)
df1 %>%
    group_by(HospNum_Id) %>% 
    mutate(ind = EVENT=="nothing" & lead(EVENT)=="EMR") %>% 
    slice(sort(c(which(ind),which(ind)+1))) %>% 
    select(-ind)
#   HospNum_Id VisitDate   EVENT   
#      <int>     <chr>   <chr>
#1          2  13/12/06 nothing
#2          2  13/04/12     EMR
#3          2  13/05/13 nothing
#4          2  13/06/14     EMR

【讨论】：

好的。我宁愿使用 dplyr，因为我更深入地了解它。 dplyr 示例中的最后一行抛出错误 Error: corrupt 'grouped_df', contains 4 rows, and 11 rows in groups but without %>% select(-ind) 我似乎得到了我想要的。最后一行添加了什么？
@SebastianZeki 您是指slice 行还是select？我没有收到任何错误。 select 行是删除 'ind' 列。
好的。我的意思是选择列。但我想我会忽略它。作为最后一个附录，如果我只想在 EMR 之前的任何地方（而不仅仅是在之前的行中）为 HospNum_Id 获得“什么都没有”发生的结果。我该如何修改？如果我需要作为一个单独的问题提出，没问题
@SebastianZeki 这可能是由于某些版本差异或数据集的结构。我想最好作为一个单独的问题问。

【解决方案2】：

只需使用基本操作即可获得所需的结果。

步骤 1. 加载数据(

Step 2. 按日期升序排列数据框

步骤 3. 选择具有 event="EMR" 的行并创建一个数据框并创建一个包含先前行的数据框。

步骤 4. 删除重复项并按日期排序

a<-loaded dataframe
a[order(as.Date(a$VisitDate,format="%d/%m/%Y")),,drop=FALSE]
revdf <- a[rev(rownames(a)),]
b<- revdf[which(revdf$EVENT=="EMR" ),] 
c<- revdf[which(revdf$EVENT=="EMR" )-1,]
d<-rbind(b,c)
e<-d[!duplicated(d),] 
f<-e[order(as.Date(e$VisitDate,format="%d/%m/%Y")),,drop=FALSE]
revdf1<-f[rev(rownames(f)),]

输出：

  >revdf1
        HospNum_Id  VisitDate  EVENT
   11          3  04/06/16 nothing
   10          3  04/05/16     EMR
   8           2  13/04/15     EMR
   9           3  03/04/15     RFA
   7           2  13/06/14     EMR
   3           1  13/05/12 nothing
   5           2  13/04/12     EMR
   2           1  13/04/05     RFA
   1           1  13/02/03     EMR

【讨论】：