【发布时间】:2021-03-31 12:34:52
【问题描述】:
我对某些数据有疑问,正在寻求您的帮助。 这是包含数千艘船只的较大数据集的子集:
subset <- tibble::tribble(
~cfr, ~vessel_name, ~reg_port, ~event_start_date, ~event_end_date, ~length, ~tonnage, ~power, ~gear, ~gear_cat, ~gear_type,
"FRA000859026", "PATOLISA", "ST", "1995-04-10", "1996-09-06", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "1996-09-07", "1996-12-31", 7.25, 3.02, 85, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "1997-01-01", "1999-12-01", 7.25, 3.02, 85, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "1999-12-02", "2000-02-03", 7.25, 3.02, 85, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "2000-02-04", "2001-06-10", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "2001-06-11", "2001-07-23", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "2001-07-24", "2002-12-31", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "2003-01-01", "2004-03-10", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears"
)
我想简化这些数据,以便为每组相似的cfr、vessel_name、reg_port、length、tonnage、 power、gear、gear_cat 和 gear_type。我的预期结果如下所示:
~cfr, ~vessel_name, ~reg_port, ~event_start_date, ~event_end_date, ~length, ~tonnage, ~power, ~gear, ~gear_cat, ~gear_type,
"FRA000859026", "PATOLISA", "ST", "1995-04-10", "1996-09-06", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "1996-09-07", "2000-02-03", 7.25, 3.02, 85, "Set gillnets (anchored)", "Entangling nets", "Passive gears",
"FRA000859026", "PATOLISA", "ST", "2000-02-04", "2004-03-10", 7.25, 3.02, 110, "Set gillnets (anchored)", "Entangling nets", "Passive gears"
)
但是,无论我尝试什么,我的结果总是将所有记录与 power = 110 结合起来,即使两者之间有一个间隔,power = 85。
特别是,我尝试了几件事,但没有按预期工作:
1. group_by() 和 mutate()
subset %>%
group_by(cfr, vessel_name, reg_port, length, tonnage, power, gear, gear_cat, gear_type) %>%
mutate(event_start_date = min(event_start_date), #Find oldest date for group
event_end_date = max(event_end_date)) %>% #Find most recent date for group
ungroup() %>%
distinct()
==> 所有 power = 110 的记录都被认为是相似的,这会创建两个愚蠢的重叠记录,一个是 power = 110 从 1995-04-10 到 2004-03-10,另一个是 power = 85 从 1996- 09-07 至 2000-02-03
2。 cur_group_id()
所以我认为为每个组创建一个连续的id 是有意义的,这样我就可以执行类似的 mutate() 但这次在我的分组中使用id。我尝试使用 cur_group_id(),但结果是一样的:所有 power = 110 的记录都被认为是相似的,即使我确实想将两个时间组分开。
subset %>%
group_by(cfr, vessel_name, reg_port, length, tonnage, power, gear, gear_cat, gear_type) %>%
mutate(id = cur_group_id()) %>%
ungroup() %>%
group_by(id) %>%
mutate(event_start_date = min(event_start_date), #Find oldest date for group
event_end_date = max(event_end_date)) %>% #Find most recent date for group
ungroup() %>%
distinct()
如何确保不会发生这种情况并获得预期的输出,即按组合并的间隔但考虑时间变化?这可能很简单,但我无法理解它......
谢谢!
【问题讨论】:
-
在您的子集数据框中,第 2 到第 4 行的幂均为 85,但开始日期和结束日期是连续的。我想问一下您的原始数据框中是否有任何行具有相同的间隔但功率等于 110?否则中断是不可避免的。
-
不,每一行对应于其中一列的变化,因此相同容器但参数不同的间隔不应相似......
标签: r dplyr data.table