【问题标题】:R: Extract largest number from character string with mixed digits and lettersR:从混合数字和字母的字符串中提取最大数字
【发布时间】:2021-03-15 04:22:41
【问题描述】:

最好,我正在寻找dplyr 解决方案。

我有

> str(p)
'data.frame':   25 obs. of  1 variable:
 $ intram_size: chr  "5" "4,7 x 6,6 mm" "4x6x7 mm" "5" ...

> head(p)
   intram_size
1            5
2 4,7 x 6,6 mm
3     4x6x7 mm
4            5
5         4x11
6          1x4

p$intram_size 表示某个肿瘤的二维测量值。我需要提取最大的数字,即测量的最大直径。一个问题是, 已被使用。

Expected output

> head(p)
   intram_size       new
1            5         5
2 4,7 x 6,6 mm       6.6
3     4x6x7 mm         7 
4            5         5
5         4x11        11 
6          1x4         4

数据样本

p <- structure(list(intram_size = c("5", "4,7 x 6,6 mm", "4x6x7 mm", 
"5", "4x11", "1x4", "7x10", "8", "3", "7", "7x4x3", "10x5", "8", 
"7", "11", "7", "10", "5", "13", "5", "3,5", "10", "2,5", "7", 
"11 x 6 x 4")), row.names = c(NA, 25L), class = "data.frame")

【问题讨论】:

    标签: r string dataframe dplyr character


    【解决方案1】:
    1. 用点替换逗号
    2. 从字符串中提取所有数字。
    3. 转换为数值并返回最大值。
    library(tidyverse)
    
    p %>%
      mutate(intram_size = str_replace_all(intram_size, ',', '.'), 
             new = str_extract_all(intram_size, '\\d+(\\.\\d+)?'), 
             new = map_dbl(new, ~max(as.numeric(.x))))
    
    #    intram_size  new
    #1             5  5.0
    #2  4.7 x 6.6 mm  6.6
    #3      4x6x7 mm  7.0
    #4             5  5.0
    #5          4x11 11.0
    #6           1x4  4.0
    #7          7x10 10.0
    #8             8  8.0
    #9             3  3.0
    #10            7  7.0
    #11        7x4x3  7.0
    #12         10x5 10.0
    #13            8  8.0
    #14            7  7.0
    #15           11 11.0
    #16            7  7.0
    #17           10 10.0
    #18            5  5.0
    #19           13 13.0
    #20            5  5.0
    #21          3.5  3.5
    #22           10 10.0
    #23          2.5  2.5
    #24            7  7.0
    #25   11 x 6 x 4 11.0
    

    【讨论】:

    • 谢谢@Ronak。你让它看起来这么容易。非常感谢。
    【解决方案2】:

    使用 dplyr(添加和修改列)和 stringr(提取模式),过程可能如下所示:

    # sample data
    p <- structure(list(intram_size = c("5", "4,7 x 6,6 mm", "4x6x7 mm", 
                                        "5", "4x11", "1x4", "7x10", "8", "3", "7", "7x4x3", "10x5", "8", 
                                        "7", "11", "7", "10", "5", "13", "5", "3,5", "10", "2,5", "7", 
                                        "11 x 6 x 4")), row.names = c(NA, 25L), class = "data.frame")
    library(dplyr)
    library(stringr)
    mod <- p %>% 
      # replace decimal separator
      mutate(intram_size = str_replace_all(intram_size, ",", "."),
             # extract numbers
             split = str_extract_all(intram_size, "[0-9\\.]+")) %>% 
      rowwise() %>% 
      # convert to right data type
      mutate(num = list(as.numeric(split)),
             # find maximum
             max = max(num, na.rm = TRUE))
    
    head(mod)
    #> # A tibble: 6 x 4
    #> # Rowwise: 
    #>   intram_size  split     num         max
    #>   <chr>        <list>    <list>    <dbl>
    #> 1 5            <chr [1]> <dbl [1]>   5  
    #> 2 4.7 x 6.6 mm <chr [2]> <dbl [2]>   6.6
    #> 3 4x6x7 mm     <chr [3]> <dbl [3]>   7  
    #> 4 5            <chr [1]> <dbl [1]>   5  
    #> 5 4x11         <chr [2]> <dbl [2]>  11  
    #> 6 1x4          <chr [2]> <dbl [2]>   4
    

    reprex package (v0.3.0) 于 2020 年 12 月 3 日创建

    【讨论】:

    • @Ronak 打败了我,但我想这是一个稍微长一点的解决方案,需要 2 个包而不是 3 个,所以把它留在这里 :)
    猜你喜欢
    • 1970-01-01
    • 2017-07-30
    • 1970-01-01
    • 1970-01-01
    • 2022-01-08
    • 2021-04-14
    • 1970-01-01
    • 2021-12-10
    • 2015-09-03
    相关资源
    最近更新 更多