【问题标题】:replace characters in string based on positions from another variable R根据来自另一个变量 R 的位置替换字符串中的字符
【发布时间】:2021-10-08 02:31:42
【问题描述】:

我有以下数据框 xo。对于每一行,我想依次查找并替换 position_of_Ns_to_remove 中列出的位置。示例中的结果新变量应该是删除所有 R 的序列。在这种情况下,我无法根据角色本身进行搜索 - 它必须基于角色的位置。

p <- data.frame(locus = c("1","2","3"), positions_of_Ns_to_remove = c("12,17,43,100","30,60,61,62",NA))
x <- data.frame(locus = c("1","1","2","3"), sequence = c("xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR","xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR","xxxxxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxRRRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx","xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"))
xo <- merge(x, p, by = c("locus"), all.x = T)

> xo
  locus                                                                                             sequence positions_of_Ns_to_remove
1     1 xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR              12,17,43,100
2     1 xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxR              12,17,43,100
3     2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxRRRxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx               30,60,61,62
4     3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                      <NA>

这在 xo 中只有 1 行时有效,但在有多行时无效。我想使用 tidyverse 函数/管道并尽可能避免循环。

  xo %>% dplyr::mutate(new_sequence = paste(
                                                    replace( unlist(strsplit(sequence, "")), as.integer(unlist(strsplit(positions_of_Ns_to_remove,","))), "" ), 
                                                   collapse = "")
                             )

我想要什么:

  locus                                                                                             new_sequence positions_of_Ns_to_remove
1     1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx              12,17,43,100
2     1 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx              12,17,43,100
3     2 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx               30,60,61,62
4     3 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx                      <NA>

【问题讨论】:

    标签: r string tidyverse str-replace gsub


    【解决方案1】:

    您可以构建一个自定义函数并将其应用于您的数据:

    library(stringr)
    
    # cuts the n-th character out of the string
    remove_pos <- function(string, n) {
      n <- as.integer(n)
      n <- n[order(n, decreasing = TRUE)]
      len <- nchar(string)
      
      output <- string
      
      for (i in n) {
        
        output <- paste0(
          str_sub(output, start = 1L, end = i - 1L),
          str_sub(output, start = i + 1, end = len)
          )
      }
      
      return(output)
      
    }
    
    xo %>% 
      mutate(positions = str_split(positions_of_Ns_to_remove, ",")) %>% 
      group_by(locus, n=row_number()) %>%
      mutate(
        new_seq = ifelse(!is.na(positions_of_Ns_to_remove), 
                         remove_pos(sequence, unlist(positions)), 
                         sequence)
        ) %>% 
      select(-positions) %>% 
      ungroup()
    

    返回

    # A tibble: 5 x 4
      locus sequence                                    positions_of_Ns_to~ new_seq                                  
      <chr> <chr>                                       <chr>               <chr>                                    
    1 1     xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxx~ 12,17,43,100        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
    2 1     xxxxxxxxxxxRxxxxRxxxxxxxxxxxxxxxxxxxxxxxxx~ 12,17,43,100        xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
    3 2     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxRxxxxxxxxxxxx~ 30,60,61,62         xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
    4 3     Rxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~ 1                   xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
    5 4     xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~ NA                  xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx~
    

    【讨论】:

    • 当我运行这个解决方案时,我在 new_seq 中只剩下 93 个字符用于轨迹 1。这是不正确的。应该有 100 - 4 = 96。有什么想法?
    • 如果我按行号和轨迹分组,那么它似乎可以正常工作。谢谢!
    • @rt11 你完全正确。我错过了这个。将其添加到答案中。如果您满意,请随时accept the answer
    • 您是否还添加了 n
    • 是的,它们并不总是按顺序结束,因此我还添加了与我的版本类似的内容。我认为这一切都在上升和上升。感谢您的想法!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-03-13
    • 2019-11-11
    • 1970-01-01
    • 2021-06-15
    • 2018-03-30
    • 2012-05-11
    • 1970-01-01
    相关资源
    最近更新 更多