Snakemake 在循环中使用规则答案

【问题标题】：Snakemake using a rule in a loopSnakemake 在循环中使用规则
【发布时间】：2019-05-23 11:17:54
【问题描述】：

我正在尝试在循环中使用 Snakemake 规则，以便该规则将上一次迭代的输出作为输入。这可能吗？如果可以，我该怎么做？

这是我的例子

设置测试数据

mkdir -p test
echo "SampleA" > test/SampleA.txt
echo "SampleB" > test/SampleB.txt

蛇人

SAMPLES = ["SampleA", "SampleB"]

rule all:
    input:
        # Output of the final loop
        expand("loop3/{sample}.txt", sample = SAMPLES)


#### LOOP ####
for i in list(range(1, 4)):
    # Setup prefix for input
    if i == 1:
        prefix = "test"
    else:
        prefix = "loop%s" % str(i-1)

    # Setup prefix for output
    opref =  "loop%s" % str(i)

    # Rule
    rule loop_rule:
        input:
            prefix+"/{sample}.txt"
        output:
            prefix+"/{sample}.txt"
            #expand("loop{i}/{sample}.txt", i = i, sample = wildcards.sample)
        params:
            add=prefix
        shell:
            "awk '{{print $0, {params.add}}}' {input} > {output}"

尝试运行该示例会产生错误CreateRuleException in line 26 of /Users/fabiangrammes/Desktop/Projects/snake_loop/Snakefile: The name loop_rule is already used by another rule。如果有人发现可以让这件事发挥作用，那就太好了！

谢谢！

【问题讨论】：

标签： python shell snakemake

【解决方案1】：

我认为这是使用递归编程的好机会。编写一个从迭代(n-1) 转换到n 的规则，而不是为每次迭代显式包含条件。所以，大致是这样的：

SAMPLES = ["SampleA", "SampleB"]

rule all:
    input:
        expand("loop3/{sample}.txt", sample=SAMPLES)

def recurse_sample(wcs):
    n = int(wcs.n)
    if n == 1:
        return "test/%s.txt" % wcs.sample
    elif n > 1:
        return "loop%d/%s.txt" % (n-1, wcs.sample)
    else:
        raise ValueError("loop numbers must be 1 or greater: received %s" % wcs.n)

rule loop_n:
    input: recurse_sample
    output: "loop{n}/{sample}.txt"
    wildcard_constraints:
        sample="[^/]+",
        n="[0-9]+"
    shell:
        """
        awk -v loop='loop{wildcards.n}' '{{print $0, loop}}' {input} > {output}
        """

正如@RussHyde 所说，您需要积极主动地确保不会触发无限循环。为此，我们确保所有案例都包含在recurse_sample 中，并使用wildcard_constraints 确保匹配准确。

【讨论】：

哦，我以前没见过wildcard_constraints，我总是把它们编码在大括号里。这真的很有帮助。
感谢 merv 真的很优雅！

【解决方案2】：

我的理解是，您的规则在运行之前会转换为 Python 代码，并且您的 Snakefile 中存在的所有原始 Python 代码都会在此过程中按顺序运行。把它想象成你的蛇形规则被评估为 python 函数。

但是有一个限制，任何规则只能被评估一次函数。

您可以使用 if/else 表达式并根据配置值等对规则进行差异评估（一次），但您不能多次评估规则。

我不确定如何重写您的 Snakefile 以实现您想要的。有没有一个真实的例子，你可以给出似乎需要循环构造的地方？

--- 编辑

对于固定次数的迭代，可以使用输入函数多次运行规则。（不过我会告诫不要这样做，要非常小心地禁止无限循环）

SAMPLES = ["SampleA", "SampleB"]

rule all:
    input:
        # Output of the final loop
        expand("loop3/{sample}.txt", sample = SAMPLES)

def looper_input(wildcards):
    # could be written more cleanly with a dictionary
    if (wildcards["prefix"] == "loop0"):
        input = "test/{}.txt".format(wildcards["sample"])
    else if (wildcards["prefix"] == "loop1"):
        input = "loop0/{}.txt".format(wildcards["sample"])
    ...
    return input


rule looper:
    input:
            looper_input
    output:
            "{prefix}/{sample}.txt"
    params:
            # ? should this be add="{prefix}" ?
            add=prefix
    shell:
            "awk '{{print $0, {params.add}}}' {input} > {output}"

【讨论】：

感谢 Russ 的输入。我的真实世界示例 SNP 效应的迭代估计。我必须迭代。有谁知道是否可以通过函数分配规则名称 - 这对我来说可能是一个解决方案
是否有不能在规则的run/shell 内定义循环的原因？
也许可以，不确定。在实践中，我有 4-5 条单独的规则。我今晚试试。马上出行
是固定的迭代次数，还是迭代收敛的问题。在前一种情况下，您可以使用输入函数来允许迭代。我会尝试添加一些代码。
感谢拉斯的努力！我会接受 merv 的回答，因为它更优雅一点，但我真的很感谢你的帮助