Snakemake：避免在执行 shell 命令之前删除输出文件答案

【问题标题】：Snakemake: Avoid removing output files before executing the shell commandSnakemake：避免在执行 shell 命令之前删除输出文件
【发布时间】：2018-03-20 03:25:30
【问题描述】：

是否有可能避免在执行shell命令之前删除snakemake规则中定义的输出文件？我在这里找到了对这种行为的描述：http://snakemake.readthedocs.io/en/stable/project_info/faq.html#can-the-output-of-a-rule-be-a-symlink

我要做的是为输入列表和输出文件列表（N:M 关系）定义规则。如果输入文件之一已更改，则应触发此规则。然后，在 shell 命令中调用的 python 脚本只创建那些不存在的输出，或者与现有文件相比，其内容发生了变化（即，在 python 脚本中实现了更改检测）。我希望类似以下规则的东西应该可以解决这个问题，但是由于在运行 python 脚本之前删除了 output.jsons，所有 output.jsons 都将使用新的时间戳创建，而不仅仅是那些已更改的时间戳。

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
output:
    jsons = ["transformation/{section}_transformation.json".format(section=s) for s in SECTIONS]
shell:
    "python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {output.jsons}"

如果无法避免在 Snakemake 中删除输出文件，是否有人知道如何将此工作流映射到 snakemake 规则而不更新所有输出文件？

更新：

我尝试通过更改 Snakemake 源代码来解决这个问题。我删除了 jobs.py 中的 self.remove_existing_output() 行，以避免在执行规则之前删除输出文件。此外，我在 executors.handle_job_success 中调用 self.dag.check_and_touch_output() 时添加了参数no_touch=True。这很好用，因为现在输出文件在执行规则之前既没有被删除也没有被触及。但是对于每个 json 文件（即使它没有更改）仍然会触发以 json 文件作为输入的以下规则，因为 Snakemake 认识到 json 文件之前被定义为输出并且必须已经更改。所以我认为避免删除输出文件并不能解决我的问题，也许一种解决方法 - 如果存在 - 是唯一的方法......

更新 2：

我也尝试通过将上面定义的jsons规则的输出路径更改为transformation/tmp/...并添加以下规则，在不更改Snakemake源代码的情况下找到解决方法：

def cmp_jsons(wildcards):
    section = int(wildcards.section)
    # compare json for given section in transformation/ with json in transformation/tmp/
    # return [] if json did not change
    # return path to tmp json filename if json has changed
rule copy:
    input:
        json_tmp = cmp_jsons
    output:
        jsonfile = "transformation/B21_{section,\d+}_affine_transformation.json"
    shell:
        "cp {input.json_tmp} {output.jsonfile}"

但由于在工作流开始之前评估输入函数，tmp-jsons 要么不存在，要么尚未被 jsons 规则更新，因此比较将不正确。

【问题讨论】：

标签： python snakemake

【解决方案1】：

这有点复杂，但我认为它可以无缝地为您工作。

解决方案涉及调用snakemake 两次，但您可以将其封装在一个shell 脚本中。在第一次调用中，您在--dryrun 中使用snakemake 来确定哪些json 将被更新，在第二次调用中，此信息用于制作DAG。我使用--config 在两种模式之间切换。这是 Snakefile。

def get_match_files(wildcards):
    """Used by jsons_fake to figure which match files each json file depend on"""
    section = wildcards.section

    ### Do stuff to figure out what matching files this json depend on
    # YOUR CODE GOES HERE
    idx = SECTIONS.index(int(section)) # I have no idea if this is what you need
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[idx], SECTIONS[idx + 1])]

    return matchfiles

def get_json_output_files(fn):
    """Used by jsons. Read which json files will be updated from fn"""
    try:
        json_files = []
        with open(fn, 'r') as fh:
            for line in fh:
                if not line:
                    continue  # skip empty lines
                split_line = line.split(maxsplit=1)
                if split_line[0] == "output:":
                    json_files.append(split_line[1])  # Assumes there is only 1 output file pr line. If more, modify.
    except FileNotFoundError:
        print(f"Warning, could not find {fn}. Updating all json files.")
        json_files = expand("transformation/{section}_transformation.json", section=SECTIONS)

    return json_files


if "configuration_run" in config:
    rule jsons_fake:
        "Fake rule used for figuring out which json files will be created."
        input:
            get_match_files
        output:
            jsons = "transformation/{section}_transformation.json"
        run:
            raise NotImplementedError("This rule is not meant to be executed")

    rule jsons_all:
        input: expand("transformation/{s}_transformation.json", s=SECTIONS]

else:
    rule jsons:
        "Create transformation files out of landmark correspondences."
        input:
            matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
        output:
            jsons = get_json_output_files('json_dryrun') # This is called at rule creation
        params:
            jsons=expand("transformation/{s}_transformation.json", s=SECTIONS]
        run:
            shell("python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}")

为避免两次调用 Snakemake，您可以将其包装在 shell 脚本中，mysnakemake

#!/usr/bin/env bash

snakemake jsons_all --dryrun --config configuration_run=yes | grep -A 2 'jsons_fake:' > json_dryrun
snakemake $@

然后像通常调用snakemake 一样调用脚本，例如：mysnakemake all -j 2。这对你有用吗？我还没有测试代码的所有部分，所以请谨慎对待。

【讨论】：

这很好用！非常感谢！尽管 Snakemake 必须被调用两次，但它绝对比调用两个不同的规则且两次调用之间的执行时间很长要好。在将 split_line[1] 添加到 json 文件列表时，我只需要添加 .rstrip()。除此之外，您的代码在我实现 get_match_files() 后立即运行。

【解决方案2】：

我认为 Snakemake 目前没有解决您的问题的方法。我认为您必须从create_transformation_jsons.py 中提取输入/输出逻辑，并为Snakefile 中的每个关系编写单独的规则。知道可以生成匿名规则可能会对您有所帮助，例如在 for 循环内。 How to deal with a variable of output files in a rule.

最近Snakemake在执行规则时开始清除日志，我开了一个issue on that。该问题的解决方案也可能对您有所帮助。但这一切都在不确定的未来，所以不要指望它。

更新

这是另一种方法。您的规则中没有任何通配符，因此我假设您只运行该规则一次。我还假设在执行时您可以列出正在更新的部分。我已将列表称为SECTIONS_PRUNED。然后你可以制定一个规则，只将这些文件标记为输出文件。

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
output:
    jsons = ["transformation/{section}_transformation.json".format(section=s) for s in SECTIONS_PRUNED]
params:
    jsons = [f"transformation/{s}_transformation.json" for s in SECTIONS]
run:
    shell("python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}")

我最初认为使用shadow: "minimal" 来确保SECTIONS_PRUNED 未能声明的任何文件都不会被虚假更新是个好主意。但是，影子的情况可能更糟：丢失的文件被更新并留在影子目录中（并且被删除而不引起注意）。使用影子，您还需要将 json 文件复制到影子目录中，以让您的脚本找出要生成的内容。

所以更好的解决方案可能是不使用阴影。如果SECTIONS_PRUNED 未能声明所有更新的文件，第二次执行snakemake 将突出（并修复）此问题并确保正确完成所有下游分析。

更新 2

另一种更简单的方法是将工作流程分成两部分，不让snakemake 知道json 规则会生成输出文件。

rule jsons:
"Create transformation files out of landmark correspondences."
input:
    matchfiles = ["matching/%04i-%04i.h5" % (SECTIONS[i], SECTIONS[i+1]) for i in range(len(SECTIONS)-1)]
params:
    jsons = [f"transformation/{s}_transformation.json" for s in SECTIONS]
shell:
    "python create_transformation_jsons.py --matchfiles {input.matchfiles} --outfiles {params.jsons}"

分两部分运行snakemake，将all替换为相关的规则名称。

$ snakemake jsons
$ snakemake all

【讨论】：

感谢您的回答。我认为我的 Snakemake 工作流程中的问题是我不知道规则 jsons 中将更改哪些 json 文件，所以我不知道如何在工作流程开始之前和之前解释的 for 循环中定义匿名规则我知道哪些输出文件需要匿名规则。你有具体的代码示例吗？解决链接的 Snakemake 问题确实可以解决我的问题，但应该考虑到不仅删除输出文件是一个问题，而且还会触发以下规则（请参阅我的问题中的更新）。
我已经更新了我的答案。但是，您将需要一种方法来确定您正在更新哪些文件。您不能从create_transformation_jsons.py 获取代码吗？
你说得对，我只运行了一次规则，但我仍然不知道如何在运行时定义一个列表SECTIONS_PRUNED。据我所知，所有 python 定义都在工作流开始之前进行评估，但我需要运行一个事先规则才能知道哪些 json 将被更新。我已经尝试过类似的东西（见我的第二次更新）。您的第二次更新是我迄今为止最好的，谢谢！我之前尝试拆分工作流，但使用输出而不是参数。但这仅适用于 Snakemake 源代码更改。但是，最好只运行一次snakemake 的解决方案。
@SarahH，根据将创建/更新哪些匹配文件，您是否有办法确定哪些 jsons 文件将被更新？那我可以帮你。
是的，要更新的 json 文件列表取决于更新/创建的匹配文件列表。从匹配文件中导出 json 并不简单，但应该可以。