【问题标题】:str_split_fixed in an if/else statement: unexpected resultsif/else 语句中的 str_split_fixed:意外结果
【发布时间】:2019-09-25 18:41:03
【问题描述】:

我在数据框中有以下形式的数据:

structure(list(O2Range = c("112 MAX", "16/19", "16/190", "12 MAX", 
NA, NA, NA, NA, NA, NA, NA, "16/20", "18/22", NA, "16/20", NA, 
"11/13", NA, "16/190", NA)), row.names = c(NA, -20L), class = c("tbl_df", 
"tbl", "data.frame"))

显然,低氧读数和高氧读数在列中用“/”隔开,但有时会以数字形式列出,然后是“MAX”(即:112 MAX)。

我正在尝试通过以下方式将此列分成两个新列:

library(tidyverse)
data$O2High <- if (str_detect(data$O2Range, "/")) {str_split_fixed(data$O2Range, fixed("/"), 2)[, 2]
} else {str_split_fixed(data$O2Range, fixed(" "), 2)[, 2]}
data$O2Low <- if (str_detect(data$O2Range, "/")) {str_split_fixed(data$O2Range, fixed("/"), 2)[, 1]
        } else {str_split_fixed(data$O2Range, fixed(" "), 2)[, 1]}

然而,结果并不像预期的那样:

structure(list(O2High = c("MAX", "", "", "MAX", "", "", "", "", 
"", "", "", "", "", "", "", "", "", "", "", ""), O2Low = c("112", 
"16/19", "16/190", "12", "", "", "", "", "", "", "", "16/20", 
"18/22", "", "16/20", "", "11/13", "", "16/190", "")), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

我的 if/else 语句似乎有问题,但我无法解决这个问题。有什么想法吗?

预期输出:

Expected output:

structure(list(O2High = list("112", "19", "190", "12", NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_, NA_character_, "20", "22", NA_character_, 
    "20", NA_character_, "13", NA_character_, "190", NA_character_), 
    O2Low = list("MAX", "16", "16", "MAX", NA_character_, 
        NA_character_, NA_character_, NA_character_, NA_character_, 
        NA_character_, NA_character_, "16", "18", NA_character_, 
        "16", NA_character_, "11", NA_character_, "16", NA_character_)), row.names = c(NA, 
-20L), class = c("tbl_df", "tbl", "data.frame"))

谢谢, 克里斯

【问题讨论】:

  • 你能显示预期的输出吗

标签: r if-statement tidyverse stringr


【解决方案1】:

不确定如何处理MAX,但是...

library(stringi)
as.data.frame(data) %>% 
     mutate(o2High = stri_extract_all_regex(O2Range, "(?<=/)[0-9]+"),
            o2Low = stri_extract_all_regex(O2Range, "[0-9]+(?=\\/)"))

   O2Range o2High o2Low
1  112 MAX     NA    NA
2    16/19     19    16
3   16/190    190    16
4   12 MAX     NA    NA
5     <NA>     NA    NA
6     <NA>     NA    NA
7     <NA>     NA    NA
8     <NA>     NA    NA
9     <NA>     NA    NA
10    <NA>     NA    NA
11    <NA>     NA    NA
12   16/20     20    16
13   18/22     22    18
14    <NA>     NA    NA
15   16/20     20    16
16    <NA>     NA    NA
17   11/13     13    11
18    <NA>     NA    NA
19  16/190    190    16
20    <NA>     NA    NA

as.data.frame(df) %>% 
    mutate(
        o2High = stri_extract_all_regex(O2Range, "(?<=/)[0-9]+|[0-9]+(?=\\sMAX)"),
        o2Low = stri_extract_all_regex(O2Range, "[0-9]+(?=\\/)")
    )
   O2Range o2High o2Low
1  112 MAX    112    NA
2    16/19     19    16
3   16/190    190    16
4   12 MAX     12    NA
5     <NA>     NA    NA
6     <NA>     NA    NA
7     <NA>     NA    NA
8     <NA>     NA    NA
9     <NA>     NA    NA
10    <NA>     NA    NA
11    <NA>     NA    NA
12   16/20     20    16
13   18/22     22    18
14    <NA>     NA    NA
15   16/20     20    16
16    <NA>     NA    NA
17   11/13     13    11
18    <NA>     NA    NA
19  16/190    190    16
20    <NA>     NA    NA

【讨论】:

    【解决方案2】:

    使用基础 R 你可以这样做:

    prot <- data.frame(high=numeric(),low=numeric())
    cbind(df, strcapture("(?:(\\d+)/)?(\\d+)(?: MAX|$)", df$O2Range, prot))
    
       O2Range high low
    1  112 MAX   NA 112
    2    16/19   16  19
    3   16/190   16 190
    4   12 MAX   NA  12
    5     <NA>   NA  NA
    6     <NA>   NA  NA
    7     <NA>   NA  NA
    8     <NA>   NA  NA
    9     <NA>   NA  NA
    10    <NA>   NA  NA
    11    <NA>   NA  NA
    12   16/20   16  20
    13   18/22   18  22
    14    <NA>   NA  NA
    15   16/20   16  20
    16    <NA>   NA  NA
    17   11/13   11  13
    18    <NA>   NA  NA
    19  16/190   16 190
    20    <NA>   NA  NA
    

    【讨论】:

      【解决方案3】:

      我们可以使用

      library(dplyr)
      library(stringr)
      library(tidyr)
      data %>% 
        separate(O2Range, into = c("O2Low", "O2High"), sep="/", remove = FALSE) %>%
        mutate(O2Low = str_remove(O2Low, "\\d+\\s+(?=MAX)"),
               O2High = case_when(str_detect(O2Range, "MAX") ~ 
                     str_extract(O2Range, "\\d+") , TRUE ~ O2High)) %>%
        select(-O2Range)
      # A tibble: 20 x 2
      #   O2Low O2High
      #   <chr> <chr> 
      # 1 MAX   112   
      # 2 16    19    
      # 3 16    190   
      # 4 MAX   12    
      # 5 <NA>  <NA>  
      # 6 <NA>  <NA>  
      # 7 <NA>  <NA>  
      # 8 <NA>  <NA>  
      # 9 <NA>  <NA>  
      #10 <NA>  <NA>  
      #11 <NA>  <NA>  
      #12 16    20    
      #13 18    22    
      #14 <NA>  <NA>  
      #15 16    20    
      #16 <NA>  <NA>  
      #17 11    13    
      #18 <NA>  <NA>  
      #19 16    190   
      #20 <NA>  <NA>  
      

      【讨论】:

      • 谢谢你。澄清一下:“低”列应该只有一个值(即:16,而“高”有 19,在第二行)。