【问题标题】:R: Simplifying long ifelse statementR:简化长 ifelse 语句
【发布时间】:2018-05-04 16:19:19
【问题描述】:

我正在尝试根据医疗数据集中具有 2500 多个值的程序代码变量创建新变量,以提取抗生素、它们的剂量和路线。我已经能够用 ifelse 语句做到这一点,但它很耗时,而且很难找到和纠正错误。有没有一种简化的方法来做到这一点?不幸的是,这些代码没有以任何合乎逻辑的方式组织。

vet <-mutate(vet, ab = ifelse(ProcedureCode=="6160"|ProcedureCode=="2028"|ProcedureCode=="6121"|ProcedureCode=="6130"|ProcedureCode=="6131"|ProcedureCode=="6132"|ProcedureCode=="6133" |ProcedureCode=="6134"|ProcedureCode=="6135"|ProcedureCode=="6136"|ProcedureCode=="6090" |ProcedureCode=="6137"|ProcedureCode=="6138"|ProcedureCode=="6139" |ProcedureCode=="6140" |ProcedureCode=="6510"|ProcedureCode=="680D" |ProcedureCode=="633E"|ProcedureCode=="661J"|ProcedureCode=="627I" |ProcedureCode=="6198"|ProcedureCode=="6199"|ProcedureCode=="6200" |ProcedureCode=="6201" |ProcedureCode=="6202"|ProcedureCode=="622G" |ProcedureCode=="697C" |ProcedureCode=="698C" |ProcedureCode=="6204"|ProcedureCode=="6775"| ProcedureCode=="6229" |ProcedureCode=="6207" |ProcedureCode=="6203" |ProcedureCode=="6205" |ProcedureCode=="6206" |ProcedureCode=="6212" |ProcedureCode=="6213" |ProcedureCode=="6214" |ProcedureCode=="6215" |ProcedureCode=="6216" |ProcedureCode=="6219" |ProcedureCode=="692C" |ProcedureCode=="643C" |ProcedureCode=="601E" |ProcedureCode=="629G" |ProcedureCode=="6234" |ProcedureCode=="6235" |ProcedureCode=="6236" |ProcedureCode=="6237" |ProcedureCode=="6238" |ProcedureCode=="615J" |ProcedureCode=="6242" |ProcedureCode=="6243" |ProcedureCode=="6244" |ProcedureCode=="6245" |ProcedureCode=="1193" |ProcedureCode=="652G" |ProcedureCode=="657G" |ProcedureCode=="697B"|ProcedureCode=="6336" |ProcedureCode=="6337" |ProcedureCode=="6338" |ProcedureCode=="6152" |ProcedureCode=="603C" |ProcedureCode=="655B" |ProcedureCode=="6357" |ProcedureCode=="6358" |ProcedureCode=="6399" |ProcedureCode=="666B" |ProcedureCode=="695D" |ProcedureCode=="699C" |ProcedureCode=="6365" |ProcedureCode=="6366" |ProcedureCode=="696F" |ProcedureCode=="6497" |ProcedureCode=="6613" |ProcedureCode=="6508" |ProcedureCode=="6509" |ProcedureCode=="617I" |ProcedureCode=="6506" |ProcedureCode=="2029" |ProcedureCode=="6538" |ProcedureCode=="671J" |ProcedureCode=="633H" |ProcedureCode=="621G" |ProcedureCode=="680J" |ProcedureCode=="672G" |ProcedureCode=="673G" |ProcedureCode=="6559" |ProcedureCode=="6652" |ProcedureCode=="6593" |ProcedureCode=="651C" |ProcedureCode=="633B" |ProcedureCode=="659E" |ProcedureCode=="676D" |ProcedureCode=="678D" |ProcedureCode=="620B" |ProcedureCode=="6562" |ProcedureCode=="6564" |ProcedureCode=="6585" |ProcedureCode=="6766" |ProcedureCode=="6595" |ProcedureCode=="6607" |ProcedureCode=="6608" |ProcedureCode=="627B" |ProcedureCode=="6653" |ProcedureCode=="6654" |ProcedureCode=="6655"|ProcedureCode=="6732" |ProcedureCode=="6733" |ProcedureCode=="6734"|ProcedureCode=="6735" |ProcedureCode=="6795"|ProcedureCode=="6745" |ProcedureCode=="6746" |ProcedureCode=="6748" |ProcedureCode=="6758" |ProcedureCode=="697E" |ProcedureCode=="6761" |ProcedureCode=="6032" |ProcedureCode=="6747" |ProcedureCode=="6749" |ProcedureCode=="668A" |ProcedureCode=="648A" |ProcedureCode=="649A" |ProcedureCode=="6765" |ProcedureCode=="6768" |ProcedureCode=="6771" |ProcedureCode=="637B"|ProcedureCode=="6894", 1,0))

问题还在于我需要创建多个组(例如:抗生素 [是/否]、剂量、路线),我觉得我缺少一种更好的方法,它不涉及剪切和粘贴变量并且每次都加引号。是否有一种方法可以制作数据框并使用 ifelse 将该数据框中的任何代码分配为 1,将其他代码分配为 0?

对不起,如果这是重复的,我对 R 比较陌生,并且很难找到词汇来搜索我需要的内容。我环顾四周(例如 Nested ifelse statement ,但还没有找到我需要的东西。

【问题讨论】:

  • 您可以使用ifelse(ProcedureCode %in% your_list_of_numbers, 1, 0)。这不是替换ifelse 的通用解决方案。您也可以查看dplyr::case_when 或base R switch
  • 或者,同样的逻辑,但完全避免ifelsemutate(vet, ab = as.integer(ProcedureCode %in% your_list_of_numbers))
  • @s.c.你的问题现在得到回答了吗?如果是,请“接受”解决方案,如果不是,请澄清问题meta.stackexchange.com/questions/5234/…

标签: r if-statement simplify


【解决方案1】:

两种替代方法,均使用合并/连接。这种方法的一个优点是它更容易维护:您拥有结构良好且易于管理的过程表,而不是带有ifelse 语句的(可能非常长的)代码行。建议 %in% 的 cmets 也减少了这个问题,尽管您将处理可管理的向量而不是可管理的帧。

假数据:

library(dplyr)
library(tidyr)
vet <- data_frame(ProcedureCode = c('6160', '2028', '2029'))
  1. 每个过程类型一帧。这是可以管理的,但如果你有很多不同的类型,可能会很烦人。对每种类型重复 left_join

    abs <- data_frame(ab=TRUE, ProcedureCode = c('6160', '2028'))
    antis <- data_frame(antibiotic=TRUE, ProcedureCode = c('2029'))
    vet %>%
      left_join(abs, by = "ProcedureCode") %>%
      left_join(antis, by = "ProcedureCode") %>%
      mutate_at(vars(ab, antibiotic), funs(!is.na(.)))
    # # A tibble: 3 × 3
    #   ProcedureCode    ab antibiotic
    #           <chr> <lgl>      <lgl>
    # 1          6160  TRUE      FALSE
    # 2          2028  TRUE      FALSE
    # 3          2029 FALSE       TRUE
    

    ab=TRUE(等)的使用是为了有一个列要合并。不匹配的行将有一个NA,这要求!is.na(.)T,NA,T 转换为T,F,T

    您甚至可以使用过程代码向量,例如:

    vet %>%
      left_join(data_frame(ab=TRUE, ProcedureCode=vector_of_abs), by = "ProcedureCode") %>%
      ...
    

    虽然这只有在您已经将代码作为向量的情况下才有帮助,否则它似乎只是您更容易维护的那个。

  2. 一帧包含所有过程,类型只需要一个帧和一个left_join

    procedures <- tibble::tribble(
      ~ProcedureCode, ~procedure,
      '6160'        , 'ab',
      '2028'        , 'ab',
      '2029'        , 'antibiotic'
    )
    left_join(vet, procedures, by = "ProcedureCode")
    # # A tibble: 3 × 2
    #   ProcedureCode  procedure
    #           <chr>      <chr>
    # 1          6160         ab
    # 2          2028         ab
    # 3          2029 antibiotic
    

    您可以保持原样(如果以这种方式存储它是有意义的)或spread 它就像其他人一样:

    left_join(vet, procedures, by = "ProcedureCode") %>%
      mutate(ignore=TRUE) %>%
      spread(procedure, ignore) %>%
      mutate_at(vars(ab, antibiotic), funs(!is.na(.)))
    # # A tibble: 3 × 3
    #   ProcedureCode    ab antibiotic
    #           <chr> <lgl>      <lgl>
    # 1          2028  TRUE      FALSE
    # 2          2029 FALSE       TRUE
    # 3          6160  TRUE      FALSE
    

    (此处加入/合并后的顺序不同,但数据保持不变。)

(我使用logicals,很容易将它们转换为1和0,也许是mutate(ab=1L*ab)mutate(ab=as.integer(ab))。)

【讨论】:

    【解决方案2】:

    一个简单选项的基本 R 方法:

    # my dummy data
    df1 <- data.frame("v1" = c(LETTERS[1:10]), "v2" = rep(NA, 10))
    
    # step 1, fill the column with 0 (the else part of your code)
    df1[,'v2'] <- 0
    
    # step 2, create a vector containing ids you want to change
    change_vec <- c("A", "C", "D", "F")
    
    # step 3, use %in% to index and replace with 1
    df1[,'v2'][df1[,'v1'] %in% change_vec] <- 1
    

    在大多数情况下,这已经足够了,但请注意使用包含数值的索引向量的风险。

    https://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f

    【讨论】:

    • 这通常很好,尽管我警告不要使用与R FAQ 7.31 相关的numeric%in%。我建议integer 或(正如OP 所说)character 为您的$v1
    • 感谢您链接到常见问题解答,我想知道为什么整数比数字更好
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多