R phyloseq中的有效子集忽略丢失的参数答案

【问题标题】：efficient subsetting in R phyloseq ignoring missing parametersR phyloseq中的有效子集忽略丢失的参数
【发布时间】：2020-02-21 21:38:37
【问题描述】：

我在工作中经常使用phyloseq。我的数据集通常包含多个条件或参数，需要以相同的方式进行分析（例如，夏季或冬季的细菌和 Lake1 或 Lake2 中的相同图），所以我想为此使用函数。我写了一个子集函数，它允许我通过循环组合多个参数。输出存储在列表中以供进一步分析。

但是，这似乎很笨拙。所以我的第一个问题是关于功能的改进。

1) 具体来说，我想知道

a) 使用多个for loops 来生成子集是个好主意。

b) 此外，for loops 和 lapply 的组合可以优化。还有

c) 也许有更好的方法来防止现有列表再次被无法识别地附加相同对象的新迭代？我实现了这一点，因为我在开发代码时有很多很多测试执行

这里讨论了 for 循环是否比 apply 慢：lapply vs for loop - Performance R

我认为phyloseq 内部调用which，所以它不必是phyloseq 特定的解决方案。

2) 我的第二个问题是如何处理这种情况，如果不是所有搜索参数都存在于所有子集中？所以在下面的例子中，如果没有丹麦男性，“丹麦”和“M”的组合就会中断。我想避免这种情况，在这个例子中只有 3 个（丹麦 x F，美国 x F，美国 x M）而不是 4 个子集。目前，该函数需要适应每个特殊的子集，这首先破坏了编写它的目的。

library(phyloseq)
data(enterotype)
# reduce the size of the data set
phyloseq <- filter_taxa(enterotype, function (x) {sum(x > 0.001) >= 1}, prune = TRUE)

# arguments for the subsetting function
phyloseq_object <- phyloseq
Nationality <- c("american", "danish")
Gender <- c("F", "M")

# define a function to obtain sample subsets from the phyloseq object 
# per combination of parameters
get_sample_subsets <- function(phyloseq_object, nation, gender) {
  sample_subset <- sample_data(phyloseq_object)[ which(sample_data(phyloseq_object)$Nationality == nation &
    sample_data(phyloseq_object)$Gender == gender),]
  phyloseq_subset <- merge_phyloseq(tax_table(phyloseq_object),
    otu_table(phyloseq_object),
    #refseq(phyloseq_object),
    sample_subset)
  phyloseq_subset2 <- filter_taxa(phyloseq_subset, function (x) {sum(x > 0) >= 1 }, prune = TRUE)
  return(phyloseq_subset2)
}

# here we pass the arguments for subsetting over two for loops
# to create all possible combinations of the subset parameters etc.
# the subsets are stored within a list, which has to be empty before running the loops 
sample_subset_list <- list()
if(length(sample_subset_list) == 0) {
  for (nations in Nationality) {
    for (gender in Gender) {
      tmp <- get_sample_subsets(phyloseq_object = phyloseq_object,
        nation = nations, gender = gender)
      sample_subset_list[[paste(nations, gender, sep = "_")]] <- tmp
    }
  }
  print(sample_subset_list)
} else {
  print("list is not empty, abort to prevent appending...")
}

# You could now for example use the output to calculate ordinations for each subset (this data set has too few entries per subset for that)

# create a list where the distance metrics for the sample subsets are stored
ordination_nmds <- list()
ordination_nmds <- lapply(sample_subset_list, ordinate, method = "NMDS",
  dist = "bray", try = 100, autotransform = TRUE)

【问题讨论】：

您能否在问题的第一部分更具体一点？您究竟在寻找哪些部分以获得更好的想法？后者似乎有些简单，但从你的问题中并不清楚你的第一个问题意味着什么。
您是否尝试过使用split() 函数来获取带有子集的预期列表？ rdocumentation.org/packages/base/versions/3.6.1/topics/split.
我认为您正在寻找split() 和lapply() 的组合使用。第一个允许您轻松创建子集并将它们存储在列表中。 lapply() 可以替换您的 for() 循环。这使您的代码更健壮、更易于调试且速度更快（在我看来）。
@Oliver：我试图让我的功能问题更加清晰。让我知道这是否对您有帮助

标签： r function subset phyloseq

【解决方案1】：

适用于 S3，但不适用于 S4（参见 cmets）

由于我不熟悉 S4，如果有更好的结果出现，我可能会删除此答案。

根据我的评论，这里可能会对您有所帮助。如果您需要更好的解决方案或者它不能解决您的问题，请告诉我。

# I changed the data because "phyloseq" package require further install
    ex_data = mtcars

# this line might replace your "get_sample_subsets" function and your loop to check if they are empty lists
# You can modify the elements inside list(...) to get the wanted subsets, it is very flexible
    sampled_data = split(ex_data, list(ex_data$cyl, ex_data$vs), drop = TRUE) # note the drop = TRUE, to avoid "empty" elements

【讨论】：

split() 方法看起来很有前途，但是phyloseq 类的对象访问起来更复杂，你知道如何调整吗？我认为 phyloseq 应该可用于 R 3.6 使用 source('http://bioconductor.org/biocLite.R') biocLite('phyloseq')?
@Paul 这是一个S4 object [adv-r.had.co.nz/OO-essentials.html#s4]。这是一个相关的答案：[stackoverflow.com/questions/10961842/…
专门用于 phyloseq，[joey711.github.io/phyloseq/preprocess.html]
我的错，不知道这个 S4 对象问题，我会尝试用编辑过的问题和你的 cmets 找到更好的解决方案:)