【发布时间】:2020-06-27 22:14:50
【问题描述】:
我需要创建一个列表,每个基因包含一个向量。向量应该是对每个基因使用 func 的结果。
【问题讨论】:
-
示例中的
genes_codon_count是什么。也许你需要sum(lengths(genes_codon_count)) -
我假设“没有任何包”的意思是“只有基本的 R 分发包”,这是默认加载到 R 中的标准包。
标签: count
我需要创建一个列表,每个基因包含一个向量。向量应该是对每个基因使用 func 的结果。
【问题讨论】:
genes_codon_count 是什么。也许你需要sum(lengths(genes_codon_count))
标签: count
您好,欢迎来到堆栈溢出。由于您没有提供任何数据进行测试,因此我制作了一些虚拟数据,如下所示。所以这些是虚拟基因的例子:
valid_codons <- c("aaa", "aac", "aag", "aat", "aca", "acc", "acg", "act",
"aga", "agc", "agg", "agt", "ata", "atc", "atg", "att", "caa", "cac",
"cag", "cat", "cca", "ccc", "ccg", "cct", "cga", "cgc", "cgg", "cgt",
"cta", "ctc", "ctg", "ctt", "gaa", "gac", "gag", "gat", "gca", "gcc",
"gcg", "gct", "gga", "ggc", "ggg", "ggt", "gta", "gtc", "gtg", "gtt",
"taa", "tac", "tag", "tat", "tca", "tcc", "tcg", "tct", "tga", "tgc",
"tgg", "tgt", "tta", "ttc", "ttg", "ttt")
genes <- replicate(3800, {
paste0(sample(valid_codons, sample(5:20, 1), replace = TRUE), collapse = "")
})
print(head(genes, 3))
#> [1] "gggtacaaagtgcat"
#> [2] "cggaaaaccggggcgtgtccg"
#> [3] "ggaccactattactctcctcgggtatagatacccgaggt"
我从函数中假设您正在使用的数据结构是字符向量,我是这样制作的:
genes_chars <- strsplit(genes, "")
print(head(genes_chars, 2))
#> [[1]]
#> [1] "g" "g" "g" "t" "a" "c" "a" "a" "a" "g" "t" "g" "c" "a" "t"
#>
#> [[2]]
#> [1] "c" "g" "g" "a" "a" "a" "a" "c" "c" "g" "g" "g" "g" "c" "g" "t" "g" "t" "c"
#> [20] "c" "g"
现在谈到您的实际问题,我将您提供的 codon_count() 函数包装在 lapply 循环中以计算结果。
codon_count <- function(gene) {
answer <- rep(0, 64)
names(answer) <- valid_codons
for(i in seq(from=1, to=length(gene), by=3)) {
codon <- tolower(paste0(gene[i], gene[i+1], gene[i+2]))
answer[codon] <- answer[codon] + 1
}
return(answer[valid_codons])
}
result <- lapply(genes_chars, codon_count)
print(head(result, 2))
#> [[1]]
#> aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat
#> 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct
#> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt
#> 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
#> tta ttc ttg ttt
#> 0 0 0 0
#>
#> [[2]]
#> aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat
#> 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct
#> 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
#> gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt
#> 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> tta ttc ttg ttt
#> 0 0 0 0
我们可以使用length() 和lengths() 检查尺寸是否正确。
unique(lengths(result))
#> [1] 64
length(result)
#> [1] 3800
不过,我认为下面的代码效率更高。
# Split character vectors into groups of three
# Based on https://stackoverflow.com/questions/11619616/how-to-split-a-string-into-substrings-of-a-given-length
splitgenes <- strsplit(genes, "(?<=.{3})", perl = TRUE)
result2 <- t(vapply(splitgenes, function(gene) {
table(factor(gene, valid_codons))
}, numeric(length(valid_codons))))
# What are the result2 dimensions and content?
dim(result2)
#> [1] 3800 64
result2[1:5, 1:5]
#> aaa aac aag aat aca
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 0 0 0 0 0
#> [4,] 0 0 0 0 0
#> [5,] 0 1 0 0 0
编辑:
这是 lapply 语句的 for 循环等效项:
result <- list()
for (i in seq_along(genes_chars)) {
result[[i]] <- codon_count(genes_chars[[i]])
}
但请注意,这样做效率较低。
【讨论】: