【问题标题】:Removing list components based on a criteria根据条件删除列表组件
【发布时间】:2021-12-14 02:30:42
【问题描述】:

我有两个基于ID 的数据集split(),产生两个列表:julyjuly2。在其中一个列表中,ID D 只有两个组件,而另一个则有 3 个组件。目标是创建一个函数来查找这些违规行为,基于列表之一没有特定数字(在本例中为 3)这一事实,并从两个列表中删除 ID。有没有有效的方法来做到这一点?

例如,每个列表都有 4 个IDs (A,B,C,D)。对于每个 ID,我在列表中为 7 月份的特定 10 天间隔创建一个数据框(例如,[[1]]ID A 的 7 月的前 10 天,[[2]]ID A 和[[3]] 是 7 月的第二个 10 天,ID 是 7 月的第三个 10 天,对于 BCD 再次开始。) july2 列表,第三个 10 天间隔被模拟为不存在,我想要一种方法将 IDjulyjuly2 列表中删除,因为它缺少最后一个间隔。

library(dplyr)
library(lubridate)

ID <-  rep(c("A","B","C", "D"), 5000)
date <-  rep_len(seq(dmy("01-01-2010"), dmy("31-01-2011"), by = "days"), 500)
x <-  runif(length(date), min = 60000, max = 80000)
y <-  runif(length(date), min = 800000, max = 900000)

ID2 <- rep(c("A", "B", "C", "D"), 5000)
date2 <-  rep_len(seq(dmy("01-01-2010"), dmy("21-01-2011"), by = "days"), 500)
x2 <-  runif(length(date2), min = 60000, max = 80000)
y2 <-  runif(length(date2), min = 800000, max = 900000)

df <- data.frame(date = date, 
                 x = x,
                 y =y,
                 ID)

df2 <- data.frame(date = date2, 
                  x = x2,
                  y =y2,
                  ID2)

df$jDate <- julian(as.Date(df$date), origin = as.Date("1970-01-01"))
df$Month <- month(df$date)
df2$jDate <- julian(as.Date(df2$date), origin = as.Date("1970-01-01"))
df2$Month <- month(df2$date)

july <- df %>%
  # Creates a new column assigning the first day in the 10-day interval in which
  # the date falls under (e.g., 01-03-2021 would be in the first 10-day interval
  # so the `floor_date` assigned to it would be 01-01-2021)
  mutate(new = floor_date(date, "10 days")) %>%
  # For any months that has 31 days, the 31st day would normally be assigned its 
  # own interval. The code below takes the 31st day and joins it with the 
  # previous interval. 
  group_by(ID) %>% 
  mutate(new = if_else(day(new) == 31, new - days(10), new)) %>% 
  group_by(new, .add = TRUE) %>%
  # Filter the data by the season based on the `season_categ` column
  filter(Month == "7") %>% 
  group_split()

july2 <- df2 %>%
  # Creates a new column assigning the first day in the 10-day interval in which
  # the date falls under (e.g., 01-03-2021 would be in the first 10-day interval
  # so the `floor_date` assigned to it would be 01-01-2021)
  mutate(new = floor_date(date, "10 days")) %>%
  # For any months that has 31 days, the 31st day would normally be assigned its 
  # own interval. The code below takes the 31st day and joins it with the 
  # previous interval. 
  group_by(ID2) %>% 
  mutate(new = if_else(day(new) == 31, new - days(10), new)) %>% 
  group_by(new, .add = TRUE) %>%
  # Filter the data by the season based on the `season_categ` column
  filter(Month == "7") %>% 
  group_split()

july2 <- july2[-12]

names(july) <- sapply(july, function(x) paste(x$ID[1]))
names(july2) <- sapply(july2, function(x) paste(x$ID2[1]))

【问题讨论】:

  • 这个问题与另一个问题有点不同,因为两个列表都包含相同的IDs,我想根据特定标准浏览一个列表并删除@ 987654346@ 来自两个列表。
  • 区别在于july2对于ID D只有两个列表元素,而july对于ID有三个列表元素。

标签: r list dplyr tidyverse


【解决方案1】:

如果您只想查找并删除计数不正确的 ID,这可能很容易。

library(tibble)

remove_component <- function(list1, list2) {
  
  bad_id <- full_join(
    enframe(table(names(list1)), "ID", "list1"),
    enframe(table(names(list2)), "ID", "list2"),
    by = "ID") %>% 
    filter(list1 != list2) %>% 
    distinct(ID) %>% 
    pull()
  
  list(new_list1 = list1[which(!((names(list1) %in% bad_id)))],
       new_list2 = list2[which(!((names(list2) %in% bad_id)))])

}

remove_component(july, july2)

我认为您制作这样的虚假数据是有原因的。否则,我不会拆分列表。如果您将数据堆叠起来,这可能会更容易、更精确。

july_full <- map_dfr(july, I) # binding rows, no idea why bind_rows is failing for me
july2_full <- map_dfr(july2, I)

# find ID and data periods that are not in both
missing_elements <- anti_join(
  distinct(july_full, ID, new),
  distinct(july2_full, ID2, new),
  by = c(ID = "ID2", "new")
)  
  
# remove that ID from both data sets
july_full %>% 
  anti_join(missing_elements, by = "ID")

july2_full %>% 
  anti_join(missing_elements, by = "ID")

【讨论】:

    猜你喜欢
    • 2011-11-29
    • 2022-12-23
    • 2022-01-20
    • 2018-10-12
    • 1970-01-01
    • 1970-01-01
    • 2016-12-25
    • 2017-05-09
    • 1970-01-01
    相关资源
    最近更新 更多