【问题标题】:Repeat rows by pattern matching and removing pattern of the string通过模式匹配和删除字符串的模式重复行
【发布时间】:2021-10-04 19:50:41
【问题描述】:

我有一张类似这样的表:

Sequence Modification Modified.Sequence
ABCDEF Acetyl (Protein N-term),Oxidation (M),Methyl (KR) AB(Acetyl (Protein N-term))CD(Oxidation (M))EF(Methyl (KR))
ABCDEFGH Oxidation (M) ABCDEF(Oxidation (M))GH
DEFGH Acetyl (Protein N-term), Methyl (KR) ABC(Acetyl (Protein N-term))DEF(Methyl (KR))GH

我需要每行只有一个修改。为此,我必须重复序列 N 次,即 N 是该序列的修改次数。并从修改后的序列中减去修改。

这是预期的:

Sequence Modification Modified.Sequence
ABCDEF Acetyl (Protein N-term) AB(Acetyl (Protein N-term))CDEF
ABCDEF Oxidation (M) ABCD(Oxidation (M))EF
ABCDEF Methyl (KR) ABCDEF(Methyl (KR))
ABCDEFGH Oxidation (M) ABCDEF(Oxidation (M))GH
DEFGH Acetyl (Protein N-term) ABC(Acetyl (Protein N-term))DEFGH
DEFGH Methyl (KR) ABCDEF(Methyl (KR))GH
df = data.frame(
        Sequence = c('ABCDEF','ABCDEFGH','DEFGH'),
        Modification = c('Acetyl (Protein N-term),Oxidation (M),Methyl (KR)','Oxidation (M)','Acetyl (Protein N-term), Methyl (KR)'),
        Modified.Sequence = c('AB(Acetyl (Protein N-term))CD(Oxidation (M))EF(Methyl (KR))','ABCDEF(Mod3))GH',
        'ABC(Acetyl (Protein N-term))DEF(Methyl (KR))GH')
)

修改可以比这个reprex中的更多。

【问题讨论】:

  • 我正在尝试,所以不允许我添加编辑...
  • @akrun 有点解决了。如果我不在桌子周围添加 ``` ``` 会出错
  • 减法部分不清楚。第一部分你可以用library(tidyr);library(dplyr);df %>% separate_rows(Modification, Modified.Sequence, sep = ",\\s*|(?<=\\))(?=[A-Z]+\\()")
  • 谢谢@akrun 现在减法部分清楚了吗?

标签: r gsub


【解决方案1】:

您可以使用以下解决方案:

library(dplyr)
library(tidyr)
library(purrr)
library(readr)

df %>%
  separate_rows(Modification, sep = ",\\s?") %>%
  rowwise() %>%
  mutate(Mod = parse_number(Modification), 
         Modified.Sequence = map2_chr(Mod, Modified.Sequence, ~ gsub(paste0("\\(Mod\\s+\\([^", .x, "]\\)\\)"), 
                                                 "", .y))) %>%
  select(!Mod)

# A tibble: 6 x 3
# Rowwise: 
  Sequence Modification Modified.Sequence
  <chr>    <chr>        <chr>            
1 ABCDEF   Mod (1)      AB(Mod (1))CDEF  
2 ABCDEF   Mod (3)      ABCD(Mod (3))EF  
3 ABCDEF   Mod (2)      ABCDEF(Mod (2))  
4 ABCDEFGH Mod (3)      ABCDEF(Mod3))GH  
5 DEFGH    Mod (1)      ABC(Mod (1))DEFGH
6 DEFGH    Mod (2)      ABCDEF(Mod (2))GH

【讨论】:

  • 非常感谢,@anoushiravan-r。虽然,您的答案对于可重现的示例非常具体,我无法将其外推到我的真实数据集。我现在将编辑示例。
  • 是的,只是放一个更通用的样本数据,我会在有空的时候立即编辑我的答案。
  • 谢谢!编辑完成:) @anoushiravan-r
【解决方案2】:

我已经找到了自己问题的答案,但可能有更简洁的方法:

df <-  df %>% tidyr::separate_rows(Modification, sep =  ',\\s?')

for (ii in seq_len(nrow(df))) {

# Obtain the total modifications and add parenthesis:
    Modifications <- paste0('(',unique(df$Modification), ')')

# Modifications to remove, is the list of modifications without the
# corresponding modification of that sequence

    modsToRemove <- Modifications[! Modifications %in%
                     paste0('(',df$Modification[ii],')')]


# Add back slash to the parentheses
    modsToRemove <- gsub(pattern = '\\(', replacement = '\\\\(', modsToRemove)

    modsToRemove <- gsub(pattern = '\\)', replacement = '\\\\)', modsToRemove)

# Collapse the modifications to remove together to make it regex pattern

    modsToRemove <- paste(unlist(modsToRemove), collapse = '|')

# Remove all the modifications for the modified sequence (except the one)
# that is in the modifiedPeptides$Modification

    df$Modified.sequence[ii] <- gsub(pattern = modsToRemove,'',
                                       x = df$Modified.sequence[ii] )
    }

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2022-06-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-02-05
    • 1970-01-01
    • 2019-08-11
    相关资源
    最近更新 更多