【问题标题】:filter rows based on all previous row data in another column根据另一列中的所有先前行数据过滤行
【发布时间】:2021-08-01 02:33:19
【问题描述】:

我有一个数据表,我想根据查看所有先前列的多个条件进行过滤。如果 New_ID.1 行号在 New_ID 列中的相同 id 之前,则从上一行中删除 New_ID= New_ID.1 的行。例如,我将删除第 3 行中的 New_ID 581,因为 New_ID.1 在第 1 行中。但是我不想删除第 6 行 New_ID 551,因为第 3 行 New_ID.551 将首先被删除。本质上,我认为我需要循环并为每一行创建一个新的过滤表并重复过程?

orig_df<- structure(list(New_ID = c(557L, 588L, 581L, 580L, 591L, 551L, 
300L, 112L), New_ID.1 = c(581L, 591L, 551L, 300L, 112L, 584L, 
416L, 115L), distance = c(3339.15537217173, 3432.33715484179, 
5268.69104753613, 5296.72042763528, 5271.94917463488, 5258.66546295312, 
5286.99982045171, 5277.81914818968), X.x = c(903604.940384474, 
819515.728302034, 903663.550206032, 866828.860223065, 819525.350044447, 
903720.790105847, 866881.654186025, 819585.173276271), Y.x = c(1027706.41509243, 
1026880.34660449, 1024367.77412815, 1023962.99139374, 1023448.02293581, 
1019099.39402149, 1018666.53407908, 1018176.41319296), X.y = c(903663.550206032, 
819525.350044447, 903720.790105847, 866881.654186025, 819585.173276271, 
903801.327345876, 866919.184271939, 819630.672367509), Y.y = c(1024367.77412815, 
1023448.02293581, 1019099.39402149, 1018666.53407908, 1018176.41319296, 
1013841.34531459, 1013379.66746509, 1012898.79016799), Y_filter = c(3338.64096427278, 
3432.32366867992, 5268.38010666054, 5296.45731465891, 5271.60974284587, 
5258.04870690871, 5286.86661398865, 5277.62302497006), X_filter = c(58.609821557533, 
9.62174241337925, 57.2398998149438, 52.7939629601315, 59.8232318238588, 
80.5372400298947, 37.5300859131385, 45.4990912381327), row.number = 1:8), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

最终结果将保留原始数据中的第 1、2、4、6 和 8 行

output_table<-structure(list(New_ID = c(557L, 588L, 580L, 551L, 112L), New_ID.1 = c(581L, 
591L, 300L, 584L, 115L), distance = c(3339.15537217173, 3432.33715484179, 
5296.72042763528, 5258.66546295312, 5277.81914818968), X.x = c(903604.940384474, 
819515.728302034, 866828.860223065, 903720.790105847, 819585.173276271
), Y.x = c(1027706.41509243, 1026880.34660449, 1023962.99139374, 
1019099.39402149, 1018176.41319296), X.y = c(903663.550206032, 
819525.350044447, 866881.654186025, 903801.327345876, 819630.672367509
), Y.y = c(1024367.77412815, 1023448.02293581, 1018666.53407908, 
1013841.34531459, 1012898.79016799), Y_filter = c(3338.64096427278, 
3432.32366867992, 5296.45731465891, 5258.04870690871, 5277.62302497006
), X_filter = c(58.609821557533, 9.62174241337925, 52.7939629601315, 
80.5372400298947, 45.4990912381327), row.number = c(1L, 2L, 4L, 
6L, 8L)), row.names = c(NA, -5L), class = c("tbl_df", "tbl", 
"data.frame"))

下面是一个更简单的问题,可能会有所帮助。

Original data
A|B
C|D
B|E
E|F

Updated data table
A|B
C|D
E|F

【问题讨论】:

    标签: r loops filter dplyr


    【解决方案1】:

    我认为遍历行并保存您已经遇到的 id 就足够了?

    orig_df <- as.data.frame(orig_df)
    included_rows <- rep(FALSE, nrow(orig_df))
    seen_ids <- c()
    for(i in 1:nrow(orig_df)){
        # Skip row if we have seen either ID already
        if(orig_df[i, 'New_ID']   %in% seen_ids) next
        if(orig_df[i, 'New_ID.1'] %in% seen_ids) next
        # If both ids are new, we save them as seen and include the entry
        seen_ids <- c(seen_ids, orig_df[i, 'New_ID'] , orig_df[i, 'New_ID.1'] )
        included_rows[i] <-  TRUE
    }
    filtered_df <- orig_df[included_rows,]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2011-01-29
      • 2021-11-07
      • 2021-12-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多