在snakemake中使用目录或目录中的所有文件作为输入答案

【问题标题】：use directories or all files in directories as input in snakemake在snakemake中使用目录或目录中的所有文件作为输入
【发布时间】：2021-05-19 01:34:54
【问题描述】：

我是蛇制造的新手。我想使用目录或目录中的所有文件作为snakemake的输入。例如，两个具有不同编号的目录。 bam 文件，

--M1
    M1-1.bam
    M1-2.bam
--M2
    M2-3.bam
    M2-5.bam

我只想合并M1-1.bam，M1-2.bam到M1.bam； M2-3.bam、M2-5.bam 到 M2.bam；我尝试使用wildcards和expand后跟this和this，代码如下，

config.yaml

SAMPLES:
  M1:
    - 1
    - 2
  M2:
    - 3
    - 5
rawdata: path/to/rawdata
outpath: path/to/output
reference: path/to/reference

snakemake 文件

configfile:"config.yaml"
SAMPLES=config["SAMPLES"]
REFERENCE=config["reference"]
RAWDATA=config["rawdata"]
OUTPATH=config["outpath"]

ALL_INPUT = []
for key, values in SAMPLES.items():
    ALL_INPUT.append(f"Map/bwa/merge/{key}.bam")
    ALL_INPUT.append(f"Map/bwa/sort/{key}.sort.bam")
    ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.bam")
    ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.matrix")
    ALL_INPUT.append(f"SNV/Mutect2/result/{key}.vcf.gz")
    ALL_INPUT.append(f"Map/bwa/result/{key}")
    for value in values:
        ALL_INPUT.append(f"Map/bwa/result/{key}/{key}-{value}.bam")
        for num in {1,2}:
            ALL_INPUT.append(f"QC/fastp/{key}/{key}-{value}.R{num}.fastq.gz")

rule all:
    input:
        expand("{outpath}/{all_input}",all_input=ALL_INPUT,outpath=OUTPATH)
        
rule fastp:
    input:
        r1= RAWDATA + "/{key}-{value}.R1.fastq.gz",
        r2= RAWDATA + "/{key}-{value}.R2.fastq.gz"
    output:
        a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
        a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
    params:
        prefix="{outpath}/QC/fastp/{key}/{key}-{value}"
    shell:
        """
        fastp -i {input.r1} -I {input.r2} -o {output.a1} -O {output.a2} -j {params.prefix}.json -h {params.prefix}.html
        """

rule bwa:
    input:
        a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
        a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
    output:
        o1="{outpath}/Map/bwa/result/{key}/{key}-{value}.bam"
    params:
        mem="4000",
        rg="@RG\\tID:{key}\\tPL:ILLUMINA\\tSM:{key}"
    shell:
        """
        bwa mem -t {threads} -M -R '{params.rg}' {REFERENCE} {input.a1} {input.a2} | samtools view -b -o {output.o1}
        """

## get sample index from raw fastq
key_ids,value_ids = glob_wildcards(RAWDATA + "/{key}-{value}.R1.fastq.gz")
# remove duplicate sample name, and this is useful when there is only one sample input
key_ids = list(set(key_ids))

rule merge:
    input:
        expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
    output:
        "{outpath}/Map/bwa/merge/{key}.bam"
    shell:
        """
        samtools merge {output} {input}
        """

合并命令中的 {input} 将是

M1-1.bam M1-2.bam M1-3.bam M1-5.bam M2-1.bam M2-2.bam M2-3.bam M2-5.bam

实际上，对于 M1 样本，{input} 应该是M1-1.bam M1-2.bam；对于 M2，M2-3.bam M2-5.bam。我也读过this，但我不知道是否有很多目录，每个目录都有不同的文件。

然后我尝试使用目录作为输入，merge rule

rule mergebam:
    input:
        "{outpath}/Map/bwa/result/{key}"
    output:
        "{outpath}/Map/bwa/merge/{key}.bam"
    log:
        "{outpath}/Map/bwa/log/{key}.merge.bam.log"
    shell:
        """
        samtools merge {output} `ls {input}/*bam` > {log} 2>&1
        """

但这给我MissingInputException error

Missing input files for rule merge:
/{outpath}/Map/bwa/result/M1

任何想法都会受到赞赏。

【问题讨论】：

标签： snakemake

【解决方案1】：

我还没有完全解析你的问题，但无论如何我都会试一试......在规则merge 你有：

expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)

这意味着您收集了outpath、key 和value 的所有组合。

大概你想要value在每个outpath和key的所有组合。所以使用：

expand("{{outpath}}/Map/bwa/result/{{key}}/{{key}}-{value}.bam",  value=value_ids)

【讨论】：

谢谢，@dariober 是的，我想要每个键和路径中的所有值组合。 value_ids 打印 [3,5,1,2] 以及 Missing input files 将出现在 rule fastp for M1-5.R1.fastq.gz and M1-5.R2.fastq.gz。虽然SAMPLES 打印{'M1':[1,2],'M2':[3,5]} 而我真的很想得到[1,2] 的'M1' 和[3,5] 的'M2' 的组合。抱歉，这听起来是个愚蠢的问题。
你也可以使用关键字参数allow_missing=True和expand来限制双重{{}}：expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam, value=value_ids, allow_missing=True)我更喜欢这个，因为你可以将路径名放在变量或配置文件中并添加更多通配符很容易。

【解决方案2】：

如果您将 config.yaml 更改为以下内容，是否可以使用 expand 使实现更容易？

SAMPLES:
  M1:
    - M1-1
    - M2-2
  M2:
    - M2-3
    - M2-5

【讨论】：

酷。谢谢@Rui Li。