R：循环遍历一个数据帧中的一组值更新第二个数据帧答案

【问题标题】：R: Loop through a set of values in one dataframe update a second dataframeR：循环遍历一个数据帧中的一组值更新第二个数据帧
【发布时间】：2020-06-11 21:53:47
【问题描述】：

更新为更真实的例子；这次在 interp_b 中添加了重复项。

我正在尝试使用来自第二个数据帧 (bait) 的值填充一个数据帧 (interp_b) 中的字段。我想查看interp_b 中每一行的obs_datetime，并确定在obs_datetime 之前最后一次引诱该情节站年份的时间。这稍后将用于计算每个obs_datetime 的自诱饵时间。诱饵时间位于bait_datetime 列中的bait 数据框中。结果应该放在interp_b 数据帧中名为latestbait_datetime 的字段中。

我正在可视化一个迭代过程，其中 interp_b "latestbait_datetime" 不断重新计算，直到到达诱饵数据框中的最后一行。我尝试的 for 循环显然是在行中运行并进行指定的计算，但我似乎无法以我想要的格式获得输出；它为每个循环生成输出，而不是重写和更新 interp_b 数据帧。

这是构建两个数据框的一些代码； interp_b 和 bait（请原谅我的粗鲁）

# interp_b dataframe----

   structure(list(plot_station_year = c("Cow_C2_2019", "RidingStable_C3_2018", 
"RidingStable_C3_2018", "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 
1544954400, 1541084400, 1515160800, 1567756800), class = c("POSIXct", 
"POSIXt"), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_, 
NA_real_, NA_real_, NA_real_, NA_real_), class = c("POSIXct", 
"POSIXt"))), class = c("spec_tbl_df", "tbl_df", "tbl", "data.frame"
), row.names = c(NA, -5L))

# bait dataframe----

    structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", 
"RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400, 
1559746800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -4L), spec = structure(list(
    cols = list(plot_station_year = structure(list(), class = c("collector_character", 
    "collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1), class = "col_spec"))

想要的结果应该是这样的

以下是我的两个尝试。第一次导致数据帧只包含循环的最终运行，第二次尝试导致数据帧包含所有运行结果（如绑定所期望的那样）。

library(tidyverse)

#attempt #1----
    for (i in 1:nrow(bait)) { 

  print(paste("row =",i))

  interpbait <- interp_b %>% 
    mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))

}


#attempt #2----
    resultb <- data.frame()

for (i in 1:nrow(bait)) { 

  print(paste("row =",i))

  interpbait2 <- interp_b %>% 
    mutate(latestbait_datetime = if_else((plot_station_year == bait$plot_station_year[i] & (obs_datetime >= bait$bait_datetime[i] & (is.na(latestbait_datetime) | latestbait_datetime < bait$bait_datetime[i]))), bait$bait_datetime[i], latestbait_datetime))

  resultb <- bind_rows(resultb, interpbait2)

  print(resultb)
}

任何帮助将不胜感激。

【问题讨论】：

Akrun，我是 StackOverflow 的新手，我更新帖子时可能不小心删除了您的宝贵评论 - 对此感到抱歉。

标签： r loops dataframe for-loop iteration

【解决方案1】：

我不确定这需要多长时间，但这里有一个 tidyverse 解决方案。对于interp_b 中的每一行，我们将bait 数据框过滤为正确的plot_station_year，并确保所有日期时间小于interp_b 中的行。然后，我们按日期时间降序排列过滤后的bait 数据（以便最近的日期在最前面）。我们对该数据框的第一行进行切片，以便我们只获得最近的日期。然后我们从数据框中“拉出”日期时间，并将其添加到interp_b 的相应行中。

library(tidyverse)
library(progress) # for progress bar

# create progress bar to update, so that you can estimate the amount of time it will take to finish the entire loop
pb <- progress_bar$new(total = nrow(interp_b))

for (i in 1:nrow(interp_b)) {

  last_time_baited <- bait %>% 
    #filter bait dataframe to appropriate plot, station, year based on
    # the row in interp_b
    filter(plot_station_year == interp_b$plot_station_year[i],
           # ensure all datetimes are less than that row in interp_b
           bait_datetime < interp_b$obs_datetime[i]) %>% 
    # arrange by datetime (most recent datetimes first)
    arrange(desc(bait_datetime)) %>% 
    # take the top row - this will be the most recent date-time that
    # the plot-station was baited
    slice(1) %>% 
    # "pull" that value out of the dataframe so you have a value, 
    # not a tibble
    pull(bait_datetime) # 

  # update the row in interp_b with the date_time baited
  interp_b$latestbait_datetime[i] <- last_time_baited

  pb$tick() # print progress
}

结果表与您的预期输出匹配 (interp_b)：

# A tibble: 5 x 3
  plot_station_year    obs_datetime        latestbait_datetime
  <chr>                <dttm>              <dttm>             
1 Cow_C2_2019          2019-06-02 15:00:00 2019-05-10 11:00:00
2 RidingStable_C3_2018 2018-12-16 10:00:00 2018-12-01 10:00:00
3 RidingStable_C3_2018 2018-11-01 15:00:00 NA                 
4 Raf_C1_2018          2018-01-05 14:00:00 2017-04-04 11:00:00
5 Metcalfe_C2_2019     2019-09-06 08:00:00 NA

【讨论】：

嗨 Nova，在一个小的子样本上运行似乎很顺利，我明天会尝试一个更大的样本。谢谢！

【解决方案2】：

您可以使用data.table 执行外连接，然后为每个 plot_station_year 选择最高的 bait_datetime。

编辑：我编辑了答案以反映interp2 中给定唯一plot_station_year 可能存在多个obs_datetime 的可能性。为了保留这些，我们将它们编入索引并将索引包含在过滤步骤中。

大文件（未经测试）的一个潜在改进可能是使用roll 进行合并，而不是执行外部合并然后进行过滤。

该版本显示在可重现示例的末尾：

library(data.table)

interp2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019", "RidingStable_C3_2018", 
    "Raf_C1_2018", "Metcalfe_C2_2019"), obs_datetime = structure(c(1559487600, 1559487300,
        1544954400, 1515160800, 1567756800), class = c("POSIXct", "POSIXt"
        ), tzone = "UTC"), latestbait_datetime = structure(c(NA_real_, 
            NA_real_, NA_real_, NA_real_), class = c("POSIXct", "POSIXt"))), class = c("spec_tbl_df", 
                "tbl_df", "tbl", "data.frame"), row.names = c(NA, -5L))

bait2 <- structure(list(plot_station_year = c("Cow_C2_2019", "Cow_C2_2019",  "Cow_C2_2019",
    "RidingStable_C3_2018", "Raf_C1_2018"), bait_datetime = structure(c(1557500400, 
        1496674800, 1576674800, 1543676400, 1491318000), class = c("POSIXct", "POSIXt"
        ), tzone = "UTC")), class = c("spec_tbl_df", "tbl_df", "tbl", 
            "data.frame"), row.names = c(NA, -5L), spec = structure(list(
                cols = list(plot_station_year = structure(list(), class = c("collector_character", 
                    "collector")), bait_datetime = structure(list(format = "%d-%m-%Y %H:%M"), class = c("collector_datetime", 
                        "collector"))), default = structure(list(), class = c("collector_guess", 
                            "collector")), skip = 1), class = "col_spec"))


# add index idx by plot_station_year, remove empty column, set keys
setDT(interp2)[, "latestbait_datetime" := NULL][, idx := 1:.N, by=plot_station_year]
setkeyv(interp2, c("plot_station_year", "idx", "obs_datetime"))

# same for bait2: set as data.table, set keys
setDT(bait2, key=c("plot_station_year", "bait_datetime"))

## option 1: merge files, then filter
# outer join on interp2 and bait2 on first column (and order by bait_datetime)
expected_out <- merge(interp2, bait2, by="plot_station_year", all=TRUE)

# set keys for sorting
setkey(expected_out, plot_station_year, idx, bait_datetime)

# select highest bait_datetime below obs_datetime by plot_station_year and idx
expected_out <- expected_out[is.na(bait_datetime) | bait_datetime < obs_datetime][,
    tail(.SD, 1), by=.(plot_station_year, idx)]

# rename and sort columns
setnames(expected_out, old="bait_datetime", new="latestbait_datetime")
setorder(expected_out, -latestbait_datetime, idx, na.last = TRUE)[]
#>       plot_station_year idx        obs_datetime latestbait_datetime
#> 1:          Cow_C2_2019   1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2:          Cow_C2_2019   2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018   1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4:          Raf_C1_2018   1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019   1 2019-09-06 08:00:00                <NA>


## option 2 (might use less memory): rolling join

bait2[, latestbait_datetime := bait_datetime]
out_alt <- bait2[interp2, .(plot_station_year, obs_datetime, idx, latestbait_datetime), 
    on=c("plot_station_year", "bait_datetime==obs_datetime"), roll=Inf]

# order
setorder(out_alt, -latestbait_datetime, idx, na.last = TRUE)[]
#>       plot_station_year        obs_datetime idx latestbait_datetime
#> 1:          Cow_C2_2019 2019-06-02 15:00:00   1 2019-05-10 15:00:00
#> 2:          Cow_C2_2019 2019-06-02 14:55:00   2 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018 2018-12-16 10:00:00   1 2018-12-01 15:00:00
#> 4:          Raf_C1_2018 2018-01-05 14:00:00   1 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019 2019-09-06 08:00:00   1                <NA>
setcolorder(out_alt, c(1,3,2,4))[]
#>       plot_station_year idx        obs_datetime latestbait_datetime
#> 1:          Cow_C2_2019   1 2019-06-02 15:00:00 2019-05-10 15:00:00
#> 2:          Cow_C2_2019   2 2019-06-02 14:55:00 2019-05-10 15:00:00
#> 3: RidingStable_C3_2018   1 2018-12-16 10:00:00 2018-12-01 15:00:00
#> 4:          Raf_C1_2018   1 2018-01-05 14:00:00 2017-04-04 15:00:00
#> 5:     Metcalfe_C2_2019   1 2019-09-06 08:00:00                <NA>

## test that both options give the same result:

identical(expected_out, out_alt)
#> [1] TRUE

【讨论】：

感谢 user12728748。我会看看你的解决方案。在实际数据库中，interp_b 有 350 万条记录/行，每个图都有多个 obs_datetime，因此我不是在寻找最大值，而是在寻找最接近但在 obs_datetime 之前的值。这是一个跟踪摄像机观测的数据库，第二个数据表（诱饵）列出了每个站点被诱饵的时间。最后，我将获得每次观察的自诱饵值。
我还更新了示例中数据框的名称以匹配我的代码。
如果“每个图有多个 obs_datetimes”意味着interp2 中的plot_station_year 不是唯一的，并且您希望将它们全部保留，则需要添加一个索引并在选择该值时包含它最接近但在 obs_datetime 之前。您应该提供一个说明这一点的示例。那么，我可以编辑答案以反映这一点。
好吧，如果您想过滤 obs_datetime 以下的最大值并保留具有 NA 值的站点，只需将 expected_out