【问题标题】:R For Loop and If-else data.tableR For 循环和 If-else data.table
【发布时间】:2023-04-09 09:48:02
【问题描述】:

我被困在我正在尝试创建的 for 循环中。示例数据集如下:

ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
"79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
"8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
"8240588160001", "8240588160001", "8240588160001", "106705689", 
"106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
18262, 18262, 18262, 18278, 18278, 18278, 18278), class = "Date"), 
    serv_to_dt = structure(c(18262, 18262, 18262, 18265, 18282, 
    18282, 18299, 18299), class = "Date"), new_pos = c("IP", 
    "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
    0, 0, 3, 4, 4, 21, 21)), row.names = c(NA, -8L), class = c("data.table", 
"data.frame"))

我正在尝试创建一个名为 start_date 的新列。此列将根据每个 person_id 的 serv_from_dt 和 serv_to_dt 日期创建。到目前为止,我这样做的方式如下:

通过每个 person_id 找到唯一的 serv_from_dt,其中 serv_from_dt 和 serv_to_dt 之间的日期差异大于 0(我们称之为 diff_date);如果按行,serv_frm_dt >= person_id 的 MAX 唯一 diff_date,并且 serv_to_dt

 values=ex[,.(uniqueN(sort(unique(serv_to_dt[ex$days_diff>0]), TRUE))), person_id]
    n = as.numeric(values[,1])
    m = as.numeric(values[,2])

for (i in m){
  ex[,`:=`(min_start = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                             serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                           sort(unique(serv_from_dt[ex$days_diff>0]))[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                     serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                  sort(unique(serv_from_dt[ex$days_diff>0]))[i], serv_from_dt)),
           max_end = fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1] & 
                                  serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[1]), 
                               sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[1], fifelse((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i] & 
                                                                                         serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[i]), 
                                                                                      sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[i], serv_from_dt))), prs_nat_key]
}

上面的代码正是我想要的,但我不知道如何为具有多个 person_ids 和多个 day_diffs 的更大数据集扩展它。我希望代码是这样的,如果 serv_frm/serv_to_dts 在最大唯一 diff_date 之间不成立,则循环到下一个唯一 diff_date。在这种情况下,两个 person_id 都只有 1 个唯一的 diff_date(所以 m = 1),但我想更新代码以在 m > 1 的情况下保持真实。我也尝试过使用 base R 来做,但不断收到错误:

for(j in 1:m){

    
    ex[, min_start := if((serv_to_dt<= sort(unique(serv_to_dt[ex$days_diff>0]), TRUE)[j] & 
                          serv_from_dt>= sort(unique(serv_from_dt[ex$days_diff>0]))[j])) sort(unique(serv_from_dt[ex$days_diff>0]))[j]]
  j = j+ 1
  
}

任何帮助将不胜感激。

【问题讨论】:

  • 请检查你的“工作代码”:我得到Error: 'list' object cannot be coerced to type 'double',因为你引用了values[,1],它从data.table返回一个data.table,而不是一个向量;加上values[,1] 不能是as.numeric-ified,它似乎是 GUID。
  • 仅供参考,每当有嵌套的fifelse 语句时,我强烈建议您查看fcase:它更易于阅读且更易于维护。
  • @r2evans 感谢您的建议!我什至不知道fcase,我一定会调查它!

标签: r dataframe for-loop if-statement data.table


【解决方案1】:

不确定您的最终结果应该是什么,但它看起来过于复杂。 例如,您创建的 date_period 表可以这样完成:

ex[, .(unique_start = first(serv_from_dt), unique_end = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#      prs_nat_key serv_from_dt unique_start unique_end
# 1: 8240588160001   2020-01-01   2020-01-01 2020-01-04
# 2: 8240588160001   2020-01-14   2020-01-14 2020-01-17
# 3:     106705689   2020-01-17   2020-01-17 2020-02-07

您似乎尝试将其重新加入原始表,而不是您想要的。是的,这就是您发布的原始表格所需的全部内容。

ex[, `:=` (start_date = first(serv_from_dt), end_date = last(serv_to_dt)), by = c("prs_nat_key", "serv_from_dt")]

#                                person_id   prs_nat_key serv_from_dt serv_to_dt new_pos days_diff start_date   end_date
#  1: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  2: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  3: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-01      IP         0 2020-01-01 2020-01-04
#  4: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-01 2020-01-04      IP         3 2020-01-01 2020-01-04
#  5: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-14      IP         0 2020-01-14 2020-01-17
#  6: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  7: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  8: 79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8 8240588160001   2020-01-14 2020-01-17      IP         3 2020-01-14 2020-01-17
#  9: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 10: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-01-21      IP         4 2020-01-17 2020-02-07
# 11: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07
# 12: 8b6ea77b-e694-48fb-a9e9-ca8bf1accc65     106705689   2020-01-17 2020-02-07      IP        21 2020-01-17 2020-02-07

【讨论】:

    【解决方案2】:

    我的最终目标是创建两个名为 min_start 和 max_end 的新列。我意识到我可以做一个连接,而不是做 ifelse 语句。以下是我使用稍大的示例数据集的步骤:

    ex <- structure(list(person_id = c("79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
    "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
    "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
    "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", 
    "79d8c6ee-62f4-4a09-a31e-a3d1a48d79a8", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
    "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65", 
    "8b6ea77b-e694-48fb-a9e9-ca8bf1accc65"), prs_nat_key = c("8240588160001", 
    "8240588160001", "8240588160001", "8240588160001", "8240588160001", 
    "8240588160001", "8240588160001", "8240588160001", "106705689", 
    "106705689", "106705689", "106705689"), serv_from_dt = structure(c(18262, 
    18262, 18262, 18262, 18275, 18275, 18275, 18275, 18278, 18278, 
    18278, 18278), class = "Date"), serv_to_dt = structure(c(18262, 
    18262, 18262, 18265, 18275, 18278, 18278, 18278, 18282, 18282, 
    18299, 18299), class = "Date"), new_pos = c("IP", "IP", "IP", 
    "IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP", "IP"), days_diff = c(0, 
    0, 0, 3, 0, 3, 3, 3, 4, 4, 21, 21)), row.names = c(NA, -12L), class = c("data.table", 
    "data.frame"))
    

    创建一个只有每个人的唯一开始/结束日期的新数据框:

    date_period <- ex[, .(unique_start = unique(serv_from_dt[days_diff>0]),
                          unique_end = unique(serv_to_dt[days_diff>0])), prs_nat_key][order(prs_nat_key,unique_start,-unique_end),]
    
    date_period %<>% distinct(prs_nat_key, unique_start, .keep_all = TRUE) %>% setDT()
    

    在这种情况下进行左连接: if date_period$prs_nat_key = ex$prs_nat_key & ex$serv_from_dt >= date_period$unique_start & ex$serv_from_dt = date_period$unique_start & ex$serv_to_dt

    ex[, c("start_date", "end_date") := 
                 date_period[ex, # join
                     .(unique_start, unique_end),
                     on = .(unique_start < serv_from_dt,
                            unique_start < serv_to_dt,
                            unique_end > serv_to_dt,
                            unique_end > serv_from_dt,
                            prs_nat_key = prs_nat_key)]]
    

    我从这个问题中找到的 --> Conditional join in data.table?

    【讨论】:

    • 你能把你想要的最终输出表作为结果发布吗?
    猜你喜欢
    • 2022-01-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-09-16
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-08-02
    相关资源
    最近更新 更多