已编辑完全修改了第三次,以更好地解决问题!顺便说一句,这里的关键解决方案(以三个辅助函数的形式)不需要Biostrings 包。
据我了解,短 DNA 查询序列需要与大量参考 DNA 序列进行匹配。这里的转折点是在参考 DNA 序列中搜索任意数量的 DNA 查询序列上的插入或缺失形式的变异。
Biostrings 包中的函数 vmatchPattern() 可以识别给定模式与一组参考序列中任意数量的不匹配的匹配。此外,vmatchPattern() 可以识别给定模式与可能的插入或删除(indel)的匹配。但是,与 matchPattern() 不同的是,vmatchPattern() 不能同时进行。
这里寻求的解决方案是生成 DNA 查询序列的变体,然后可以将其传递给搜索功能,例如 grepl() 或此处建议的 vmatchPattern()。
此处发布的修订解决方案包括三个功能。 makeDel() 将生成具有任意数量删除的短序列的所有可能变体。伴随函数makeIns() 将生成短序列的变体,其中插入指定为symbol 中的IUPAC 符号。 makeSub() 将使用symbol 中的 IUPAC 代码指定的碱基进行所需的替换。这种方法可以生成其他碱基的所有可能组合,允许字符串直接用于模式匹配函数,包括vmatchPaterrn。
如果要使用它,这可以确保包Biostrings 可用。此代码适用于 3.60 及更高版本的 R 版本。
if (!require("Biostrings")) {
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("Biostrings")
}
library(Biostrings)
现在是一些测试数据。将使用原始查询序列"GGCGACTG"
作为“查询”,200 到 400 个核苷酸之间的 1000 个随机序列将用作参考集。
seq1 <- "GGCGACTG"
set.seed(1234)
ref <- replicate(1e3,
sample(c("A", "C", "G", "T"), sample(200:400, 1), replace = TRUE),
simplify = FALSE)
ref <- sapply(ref, paste, collapse = "")
ref <- DNAStringSet(ref)
在继续解决方案之前,让我们先看看仅使用该模式可以找到什么。
# how times does the pattern occur?
table(vcountPattern(seq1, ref)) # just 3 times
> 0 1
> 997 3
# how many times allowing for one mismatch?
# once in 96 sequences and twice in three sequences
n <- vcountPattern(seq1, ref, max.mismatch = 1)
table(n)
> n
> 0 1 2
> 901 96 3
# examine the matched sequences
m <- vmatchPattern(seq1, ref, max.mismatch = 1) # find the patterns
sel <- which.max(n) # select the first one with 2 matches
Views(ref[[sel]], m[[sel]]) # examine the matches
> Views on a 384-letter DNAString subject
> subject: TCGCGTCGCACTTCTGCTAACACAGC...GCCCAGTCGACTGCTGCTCGGATTGC
> views:
> start end width
> [1] 104 111 8 [GGCGACCG]
> [2] 364 371 8 [GTCGACTG]
这里是生成变体的三个辅助函数。参数seq 可以是字符串,例如“GGGCGACTG”或DNAString 对象。参数n 是一个整数,它指定删除、插入或替换的上限。这些函数将创建具有 0、1、...、n 个删除、插入或替换的变体。如果n 设置为0,函数将返回原始序列。 makeIns() 和 makeSub() 的参数 symbol 应该是单个 IUPAC 字符,以指定将插入或替换哪些碱基。 “N”的默认值指定所有可能的碱基(“A”、“C”、“G”和“T”)。
makeDel() 使用combn() 标识删除位置。 makeIns() 和 makeSub() 的逻辑有点复杂,因为需要允许插入彼此相邻并且需要创建所有组合。这里我选择了 not 在查询序列的开头或结尾添加插入。
所有函数都返回一个适合在vmatchPattern() 或grep() 中使用的字符向量。
在 DNA 字符串中创建删除:
##
## makeDel - create 1:n deletions in a character string (DNA sequence)
## return a character vector of all possible variants
##
makeDel <- function(seq, n) {
# accept only a single value for 'seq'
cseq <- as.character(seq)
cseq <- unlist(strsplit(cseq[1], ""))
nseq <- length(cseq)
# simple argument checks
if (!is(n, "numeric")) stop("'n' must be an integer")
if (n == 0) return(paste(cseq, collapse = ""))
if (n >= nseq) stop("too many deletions for ", nseq, " letters")
# create all possible combinations to be dropped in 'index'
index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
index <- unlist(index, recursive = FALSE)
# drop base in each possible position and reassemble
ans <- lapply(index, function(idx) cseq[-idx])
ans <- sapply(ans, paste, collapse = "")
ans <- unique(ans) # remove duplicates
return(ans)
}
在 DNA 字符串中创建插入:
##
## makeIns - create 1:n insertions into DNA string (character vector)
## where each insertion is one of a given IUPAC-specified symbol
## return a character vector of all possible variants
##
makeIns <- function(seq, n, symbol = "N") {
# IUPAC codes for ambiguous bases
iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
D = "AGT", B = "CGT")
# only accept single value for 'seq'
cseq <- as.character(seq)
cseq <- unlist(strsplit(cseq[1], ""))
nseq <- length(cseq)
# simple argument checks
if (!is(n, "numeric")) stop("'n' must be an integer")
symbol <- toupper(symbol)
if (nchar(symbol) != 1 | !symbol %in% names(iupac))
stop("'symbol' must be a single valid IUPAC symbol")
if (n == 0) return(paste(cseq, collapse = ""))
if (n >= nseq) stop("seems like too many insertions for ", nseq, " letters")
# which bases are to be inserted?
ACGT <- strsplit(iupac[symbol], "")[[1]]
# create all possible combinations of positions for the insertion
ipos <- seq_len(nseq - 1) # insert after this position
index <- lapply(1:n, function(i) do.call(expand.grid, rep(list(ipos), i)))
index <- lapply(index, function(v) split(v, seq_len(nrow(v))))
index <- unlist(index, recursive = FALSE)
index <- lapply(index, unlist)
index <- lapply(index, sort)
# place the required number of insertions after each position in index
res <- lapply(index, function(idx) {
tally <- rle(idx)$lengths
breaks <- diff(c(0, idx, nseq))
NN <- Map(base::rep, symbol, tally)
spl <- split(cseq, rep(seq_along(breaks), breaks))
sel <- head(seq_along(spl), -1)
spl[sel] <- Map(base::c, spl[sel], NN)
ans <- unlist(spl)
if (length(ACGT) > 1) { # replicate and replace with appropriate bases
sites <- grep(symbol, ans)
nsites <- length(sites)
nsymbol <- length(ACGT)
bases <- expand.grid(rep(list(ACGT), nsites), stringsAsFactors = FALSE)
bases <- as.matrix(bases)
nvars <- nrow(bases)
ans <- do.call(rbind, rep(list(ans), nvars))
ans[, sites] <- bases
ans <- split(ans, seq_len(nvars))
ans <- lapply(ans, paste, collapse = "")
}
else
ans <- paste(ans, collapse = "")
return(ans)
})
res <- unlist(res)
res <- unique(res)
return(res)
}
在 DNA 字符串中创建替换:
##
## makeSub - create an arbitrary number of substitutions in each 1:n positions
## with the IUPAC bases specified by 'symbol'
## return a character vector with all possible variants
##
makeSub <- function(seq, n, symbol = "N")
{
# IUPAC codes for ambiguous bases
iupac <- c(N = "ACGT", A = "A", C = "C", G = "G", T = "T", M = "AC", R = "AG",
W = "AT", S = "CG", Y = "CT", K = "GT", V = "ACG", H = "ACT",
D = "AGT", B = "CGT")
# accept only a single value for 'seq'
cseq <- as.character(seq)
cseq <- unlist(strsplit(cseq[1], ""))
nseq <- length(cseq)
# simple argument checks
if (!is(n, "numeric")) stop("'n' must be an integer")
symbol <- toupper(symbol)
if (nchar(symbol) != 1 | !symbol %in% names(iupac))
stop("'symbol' must be a single valid IUPAC symbol")
if (n == 0) return(paste(cseq, collapse = ""))
if (n > nseq) stop("too many substitutions for ", nseq, " bases")
# which bases are to be used for the substitution?
ACGT <- strsplit(iupac[symbol], "")[[1]]
# create all possible combinations of positions to be changed in 'index'
index <- lapply(seq_len(n), function(j) combn(nseq, j, simplify = FALSE))
index <- unlist(index, recursive = FALSE)
# for each numeric vector in index, create as many variants as
# alternative bases are needed, collect in 'ans'
ans <- lapply(index, function(idx) {
bases <- lapply(cseq[idx], function(v) setdiff(ACGT, v))
bases <- bases[sapply(bases, length) > 0] # defensive
bases <- expand.grid(bases, stringsAsFactors = FALSE)
bases <- as.matrix(bases)
nvars <- nrow(bases)
vars <- do.call(rbind, rep(list(cseq), nvars))
vars[ ,idx] <- bases
if (!is.null(vars))
return(split(vars, seq_len(nvars)))
})
ans <- unlist(ans, recursive = FALSE)
ans <- sapply(ans, paste, collapse = "")
ans <- unique(ans) # remove duplicates
return(ans)
}
输出示例:
makeDel(seq1, 0)
> [1] "GGCGACTG"
makeDel(seq1, 1)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
makeDel(seq1, 2)
> [1] "GCGACTG" "GGGACTG" "GGCACTG" "GGCGCTG" "GGCGATG" "GGCGACG" "GGCGACT"
> [8] "CGACTG" "GGACTG" "GCACTG" "GCGCTG" "GCGATG" "GCGACG" "GCGACT"
> [15] "GGGCTG" "GGGATG" "GGGACG" "GGGACT" "GGCCTG" "GGCATG" "GGCACG"
> [22] "GGCACT" "GGCGTG" "GGCGCG" "GGCGCT" "GGCGAG" "GGCGAT" "GGCGAC"
makeIns(seq1, 1) # default form
> [1] "GAGCGACTG" "GCGCGACTG" "GGGCGACTG" "GTGCGACTG" "GGACGACTG" "GGCCGACTG"
> [7] "GGTCGACTG" "GGCAGACTG" "GGCGGACTG" "GGCTGACTG" "GGCGAACTG" "GGCGCACTG"
> [13] "GGCGTACTG" "GGCGACCTG" "GGCGAGCTG" "GGCGATCTG" "GGCGACATG" "GGCGACGTG"
> [19] "GGCGACTTG" "GGCGACTAG" "GGCGACTCG" "GGCGACTGG"
makeIns(seq1, 1, symbol = "Y") # inserting only "C" or "T"
> [1] "GCGCGACTG" "GTGCGACTG" "GGCCGACTG" "GGTCGACTG" "GGCTGACTG" "GGCGCACTG"
> [7] "GGCGTACTG" "GGCGACCTG" "GGCGATCTG" "GGCGACTTG" "GGCGACTCG"
makeSub("AAA", 1)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT"
makeSub("AAA", 2)
> [1] "CAA" "GAA" "TAA" "ACA" "AGA" "ATA" "AAC" "AAG" "AAT" "CCA" "GCA" "TCA"
> [13] "CGA" "GGA" "TGA" "CTA" "GTA" "TTA" "CAC" "GAC" "TAC" "CAG" "GAG" "TAG"
> [25] "CAT" "GAT" "TAT" "ACC" "AGC" "ATC" "ACG" "AGG" "ATG" "ACT" "AGT" "ATT"
这些函数可以与vmatchPattern() 一起使用来创建变体和提取匹配。一种建议的方法是首先使用max.mismatch = 1 找到那些不匹配的序列。 下一步,使用vmatchPattern() 和fixed = FALSE 以及max.mismatch 的默认值0 查找有缺失和插入的序列。
或者,辅助函数生成的显式模式可以传递给并行运行的grep 进程!以下显示了vmatchPattern 的使用,但可能有理由使用不同的工具进行分析。请参阅有关此主题的 cmets。
# first, allow mismatches to the original pattern
# the result is a "ByPos_MIndex" object of length 1000
m1 <- vmatchPattern(seq1, ref, max.mismatch = 1) # as before...
n1 <- elementNROWS(m1) # counts the number of elements (matches)
which(n1 > 0) # which of the 1000 ref sequence had a match with 0 or 1 mismatches?
> [1] 14 71 73 76 79 88 90 108 126 129 138 141 150 160 163 179 180 195 200
> [20] 205 212 225 227 239 241 246 247 255 276 277 280 299 310 335 338 345 347 357
> [39] 359 369 378 383 387 390 391 404 409 410 414 418 469 472 479 488 499 509 523
> [58] 531 533 567 571 574 580 588 590 591 594 601 634 636 646 654 667 679 685 694
> [77] 696 713 717 732 734 737 749 750 761 762 783 815 853 854 857 903 929 943 959
> [96] 969 981 986 998
# Second search each of the patterns with lapply
# generates seven lists of objects, each of length 10000
pat2 <- makeDel(seq1, 1)
m2 <- lapply(pat2, function(pat) vmatchPattern(pat, ref))
# generates 22 lists of objects, each of length 10000
pat3 <- makeIns(seq1, 1)
m3 <- lapply(pat3, function(pat) vmatchPattern(pat, ref))
m2 和 m3 中的第二个和第三个结果是“ByPos_MIndex”对象的列表。下面的示例从m2 中提取匹配的数量,并以str() 的缩写形式显示这些匹配。列表中的每个值都标识了与相应模式至少有一次匹配的参考序列。
n2 <- lapply(m2, elementNROWS)
str(sapply(n2, function(n) which(n > 0)))
> List of 7
> $ : int [1:14] 14 138 179 335 369 391 567 679 713 734 ...
> $ : int [1:18] 138 200 240 298 310 343 510 594 598 599 ...
> $ : int [1:15] 21 26 45 60 260 497 541 600 607 642 ...
> $ : int [1:17] 27 54 120 121 123 132 210 242 244 257 ...
> $ : int [1:18] 15 33 110 126 154 419 528 539 546 606 ...
> $ : int [1:12] 24 77 79 139 525 588 601 679 770 850 ...
> $ : int [1:15] 179 345 378 414 469 571 574 580 591 713 ...
最后一个示例通过相同的机制检查 22 个“ByPos_MIndex”对象 (m3) 的第三个列表。它显示了 22 个变体中的一些不匹配,一些匹配一次,五个匹配两次。
n3 <- lapply(m3, elementNROWS) # extract all counts
names(n3) <- sprintf("pat_%02d", seq_along(n3)) # for convenience
str(lapply(n3, function(n) which(n > 0)))
> List of 22
> $ pat_01: int 679
> $ pat_02: int 391
> $ pat_03: int(0)
> $ pat_04: int 737
> $ pat_05: int(0)
> $ pat_06: int(0)
> $ pat_07: int 108
> $ pat_08: int 276
> $ pat_09: int 439
> $ pat_10: int [1:2] 764 773
> $ pat_11: int(0)
> $ pat_12: int [1:2] 22 820
> $ pat_13: int 795
> $ pat_14: int [1:2] 914 981
> $ pat_15: int(0)
> $ pat_16: int 112
> $ pat_17: int 884
> $ pat_18: int(0)
> $ pat_19: int [1:2] 345 378
> $ pat_20: int [1:2] 571 854
> $ pat_21: int 574
> $ pat_22: int(0)
不用说,为了提取序列信息,仍然需要进行大量的数据处理。这可以通过matchPattern 的帮助页面进行编码,并对help("lowlevel-matching", package = "Biostrings") 中描述的模式匹配逻辑有所了解。
尽管Biostrings 中的例程使用非常快速且非常节省内存的算法来处理大型序列。在其他情况下,乔似乎更快地找到原始搜索。总是有更多东西要学!