通过模式匹配和删除字符串的模式重复行答案

【问题标题】：Repeat rows by pattern matching and removing pattern of the string通过模式匹配和删除字符串的模式重复行
【发布时间】：2021-10-04 19:50:41
【问题描述】：

我有一张类似这样的表：

Sequence	Modification	Modified.Sequence
ABCDEF	Acetyl (Protein N-term),Oxidation (M),Methyl (KR)	AB(Acetyl (Protein N-term))CD(Oxidation (M))EF(Methyl (KR))
ABCDEFGH	Oxidation (M)	ABCDEF(Oxidation (M))GH
DEFGH	Acetyl (Protein N-term), Methyl (KR)	ABC(Acetyl (Protein N-term))DEF(Methyl (KR))GH

我需要每行只有一个修改。为此，我必须重复序列 N 次，即 N 是该序列的修改次数。并从修改后的序列中减去修改。

这是预期的：

Sequence	Modification	Modified.Sequence
ABCDEF	Acetyl (Protein N-term)	AB(Acetyl (Protein N-term))CDEF
ABCDEF	Oxidation (M)	ABCD(Oxidation (M))EF
ABCDEF	Methyl (KR)	ABCDEF(Methyl (KR))
ABCDEFGH	Oxidation (M)	ABCDEF(Oxidation (M))GH
DEFGH	Acetyl (Protein N-term)	ABC(Acetyl (Protein N-term))DEFGH
DEFGH	Methyl (KR)	ABCDEF(Methyl (KR))GH

df = data.frame(
        Sequence = c('ABCDEF','ABCDEFGH','DEFGH'),
        Modification = c('Acetyl (Protein N-term),Oxidation (M),Methyl (KR)','Oxidation (M)','Acetyl (Protein N-term), Methyl (KR)'),
        Modified.Sequence = c('AB(Acetyl (Protein N-term))CD(Oxidation (M))EF(Methyl (KR))','ABCDEF(Mod3))GH',
        'ABC(Acetyl (Protein N-term))DEF(Methyl (KR))GH')
)

修改可以比这个reprex中的更多。

【问题讨论】：

我正在尝试，所以不允许我添加编辑...
@akrun 有点解决了。如果我不在桌子周围添加 ``` ``` 会出错
减法部分不清楚。第一部分你可以用library(tidyr);library(dplyr);df %>% separate_rows(Modification, Modified.Sequence, sep = ",\\s*|(?<=\\))(?=[A-Z]+\\()")
谢谢@akrun 现在减法部分清楚了吗？

标签： r gsub

【解决方案1】：

您可以使用以下解决方案：

library(dplyr)
library(tidyr)
library(purrr)
library(readr)

df %>%
  separate_rows(Modification, sep = ",\\s?") %>%
  rowwise() %>%
  mutate(Mod = parse_number(Modification), 
         Modified.Sequence = map2_chr(Mod, Modified.Sequence, ~ gsub(paste0("\\(Mod\\s+\\([^", .x, "]\\)\\)"), 
                                                 "", .y))) %>%
  select(!Mod)

# A tibble: 6 x 3
# Rowwise: 
  Sequence Modification Modified.Sequence
  <chr>    <chr>        <chr>            
1 ABCDEF   Mod (1)      AB(Mod (1))CDEF  
2 ABCDEF   Mod (3)      ABCD(Mod (3))EF  
3 ABCDEF   Mod (2)      ABCDEF(Mod (2))  
4 ABCDEFGH Mod (3)      ABCDEF(Mod3))GH  
5 DEFGH    Mod (1)      ABC(Mod (1))DEFGH
6 DEFGH    Mod (2)      ABCDEF(Mod (2))GH

【讨论】：

非常感谢，@anoushiravan-r。虽然，您的答案对于可重现的示例非常具体，我无法将其外推到我的真实数据集。我现在将编辑示例。
是的，只是放一个更通用的样本数据，我会在有空的时候立即编辑我的答案。
谢谢！编辑完成:) @anoushiravan-r

【解决方案2】：

我已经找到了自己问题的答案，但可能有更简洁的方法：

df <-  df %>% tidyr::separate_rows(Modification, sep =  ',\\s?')

for (ii in seq_len(nrow(df))) {

# Obtain the total modifications and add parenthesis:
    Modifications <- paste0('(',unique(df$Modification), ')')

# Modifications to remove, is the list of modifications without the
# corresponding modification of that sequence

    modsToRemove <- Modifications[! Modifications %in%
                     paste0('(',df$Modification[ii],')')]


# Add back slash to the parentheses
    modsToRemove <- gsub(pattern = '\\(', replacement = '\\\\(', modsToRemove)

    modsToRemove <- gsub(pattern = '\\)', replacement = '\\\\)', modsToRemove)

# Collapse the modifications to remove together to make it regex pattern

    modsToRemove <- paste(unlist(modsToRemove), collapse = '|')

# Remove all the modifications for the modified sequence (except the one)
# that is in the modifiedPeptides$Modification

    df$Modified.sequence[ii] <- gsub(pattern = modsToRemove,'',
                                       x = df$Modified.sequence[ii] )
    }

【讨论】：