【问题标题】:Snakemake: Data-dependent conditional execution of rules, IndexErrorSnakemake:规则的数据依赖条件执行,IndexError
【发布时间】:2021-08-26 20:50:35
【问题描述】:

当执行下面的snakemake管道时,我得到一个错误:IndexError: list index out of range。我认为这是因为所有 SAMPLE 都在执行 fastqc_pretrim。但是,并非所有样本都通过碱基检出 QC,因此这里只需要处理一些文件。我正在尝试使用检查点来运行它。查看日志,我们可以看到它正在尝试为示例“FAQ20773_pass_barcode01_68fda206_1”运行 fastqc_pretrim。但是,如果您查看 LOG 中该行的上方,FAQ20773_fail_barcode03_68fda206_0 实际上是唯一通过 .fastq.gz 文件传递​​的样本。我不确定为什么没有运行正确的示例。

日志:

snakemake --use-conda --jobs 1 -pr
['FAQ20773_fail_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_2', 'FAQ20773_fail_barcode03_68fda206_0', 'FAQ20773_fail_barcode02_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_0', 'FAQ20773_pass_barcode01_68fda206_1']
The flag 'directory' used in rule guppy_basecall_persample is only valid for outputs, not inputs.        
Building DAG of jobs...                                                                                                                                                                                  
Updating job fastqc_pretrim.                                                                                                                                                                           
basecall/FAQ20773_fail_barcode01_68fda206_0                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_pass_barcode01_68fda206_2                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_fail_barcode03_68fda206_0                                                                                                                                                               
['basecall/FAQ20773_fail_barcode03_68fda206_0/pass/fastq_runid_68fda20603fe08e9e2a4eef8718997203b603497_0_0.fastq.gz']                                                                                    
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_fail_barcode02_68fda206_0                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_pass_barcode01_68fda206_0                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Updating job fastqc_pretrim.                                                                                                                                                                              
basecall/FAQ20773_pass_barcode01_68fda206_1                                                                                                                                                               
[]                                                                                                                                                                                                        
Updating job all.                                                                                                                                                                                         
Using shell: /usr/bin/bash                   

[Thu Aug 26 13:13:51 2021]                                                                                                                                                                                
rule fastqc_pretrim:                                                                                                                                                                                          
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip                                                                        
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log                                                                                                                                           
jobid: 19                                                                                                                                                                                                 
reason: Missing output files: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip                                                                                                             
wildcards: sample=FAQ20773_pass_barcode01_68fda206_1                                                                                                                                                      
resources: tmpdir=/tmp                                                                                                                                                                                                                                                                                                                                                                                          
/home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py                                               
Activating conda environment: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4                                                                     
Traceback (most recent call last):                                                                                                                                                                          
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py", line 41, in <module>                                                                                shell(                                                                                                                                                                                                  File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/shell.py", line 130, in __new__                                                                                   cmd = format(cmd, *args, stepout=2, **kwargs)                                                                                                                                                           File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/utils.py", line 427, in format                                                                                    return fmt.format(_pattern, *args, **variables)                                                                                                                                                         File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 161, in format                                                                                                           return self.vformat(format_string, args, kwargs)                                                                                                                                                        
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 165, in vformat                                                                                                          result, _ = self._vformat(format_string, args, kwargs, used_args, 2)                                                                                                                                    
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 205, in _vformat                                                                                                         obj, arg_used = self.get_field(field_name, args, kwargs)                                                                                                                                                
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/string.py", line 278, in get_field                                                                                                        obj = obj[i]                                                                                                                                                                                            
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/site-packages/snakemake/io.py", line 1536, in __getitem__                                                                                 return super().__getitem__(key)                                                                                                                                                                       
IndexError: list index out of range                                                                                                                                                                       
[Thu Aug 26 13:13:52 2021]                                                                                                                                                                                
Error in rule fastqc_pretrim:                                                                                                                                                                                 
jobid: 19                                                                                                                                                                                                 
output: qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.html, qc/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1_fastqc.zip                                                                        
log: logs/fastqc_pretrim/FAQ20773_pass_barcode01_68fda206_1.log (check log file(s) for error message)                                                                                                     
conda-env: /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4                                                                                                                                                                                                                                                                                              
RuleException:                                                                                                                                                                                            
CalledProcessError in line 60 of /mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile:                                                                                                         
Command 'source /home/hvasquezgross/miniconda3/bin/activate '/mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/conda/224336800b4f74953334e368c2f338c4'; /home/hvasquezgross/miniconda3/envs/snakemake/bin/python3.9 /mypool/projects/steve_frese/snakemake_guppy_basecall/.snakemake/scripts/tmpzxqxx28h.wrapper.py' returned non-zero exit status 1.                                                  
File "/mypool/projects/steve_frese/snakemake_guppy_basecall/Snakefile", line 60, in __rule_fastqc_pretrim                                                                                                 
File "/home/hvasquezgross/miniconda3/envs/snakemake/lib/python3.9/concurrent/futures/thread.py", line 52, in run                                                                                        
Shutting down, this might take some time.                                                                                                                                                                 
Exiting because a job execution failed. Look above for error message  

蛇形

import glob                                                                                                                                                                                                                                                                                                                                                                                                         
configfile: "config.yaml"                                                                                                                                                                                                                                                                                                                                                                                           
inputdirectory=config["directory"]                                                                                                                                                                        
SAMPLES, = glob_wildcards(inputdirectory+"/{sample}.fast5", followlinks=True)                                                                                                                             
print(SAMPLES)                                                                                                                                                                                                                                                                                                                                                                                                      
wildcard_constraints:                                                                                                                                                                                         
sample="\w+\d+_\w+_\w+\d+_.+_\d"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
##### target rules #####                                                                                                                                                                                  
rule all:                                                                                                                                                                                                     
input:                                                                                                                                                                                                       
   expand('basecall/{sample}/sequencing_summary.txt', sample=SAMPLES),                                                                                                                                       
   "qc/multiqc.html"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      

rule make_indvidual_samplefiles:                                                                                                                                                                              
input:                                                                                                                                                                                                        
   inputdirectory+"/{sample}.fast5",                                                                                                                                                                     
output:                                                                                                                                                                                                       
   "lists/{sample}.txt",                                                                                                                                                                                 
shell:                                                                                                                                                                                                        
   "basename {input}  > {output}"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        


checkpoint guppy_basecall_persample:                                                                                                                                                                          
input:                                                                                                                                                                                                        
   directory=directory(inputdirectory),                                                                                                                                                                      
   samplelist="lists/{sample}.txt",                                                                                                                                                                      
output:                                                                                                                                                                                                       
   summary="basecall/{sample}/sequencing_summary.txt",                                                                                                                                                       
   directory=directory("basecall/{sample}/"),                                                                                                                                                            
params:                                                                                                                                                                                                       
   config["basealgo"]                                                                                                                                                                                    
shell:                                                                                                                                                                                                        
   "guppy_basecaller -i {input.directory} --input_file_list {input.samplelist} -s {output.directory} -c {params} --compress_fastq -x \"auto\" --gpu_runners_per_device 3 --num_callers 2 --chunks_per_runner 200"                                                                                                                                                                                                                                                                                                                                                                                                        


def aggregate_input(wildcards):                                                                                                                                                                               
   checkpoint_output = checkpoints.guppy_basecall_persample.get(**wildcards).output[1]                                                                                                                       
   print(checkpoint_output)                                                                                                                                                                                  
   exparr = expand("basecall/{sample}/pass/{runid}.fastq.gz", sample=wildcards.sample, 
   runid=glob_wildcards(os.path.join(checkpoint_output, "pass/", "{runid}.fastq.gz")).runid)                             
   print(exparr)                                                                                                                                                                                             
   return exparr    

rule fastqc_pretrim:
    input:
        aggregate_input
    output:
        html="qc/fastqc_pretrim/{sample}.html",
        zip="qc/fastqc_pretrim/{sample}_fastqc.zip" # the suffix _fastqc.zip is necessary for multiqc to find the file. If not using multiqc, you are free to choose an arbitrary filename
    params: ""
    log:
        "logs/fastqc_pretrim/{sample}.log"
    threads: 1
    wrapper:
        "0.77.0/bio/fastqc"

rule multiqc:
    input:
        #expand("basecall/{sample}.fastq.gz", sample=SAMPLES)
        expand("qc/fastqc_pretrim/{sample}_fastqc.zip", sample=SAMPLES)
    output:
        "qc/multiqc.html"
    params:
        ""  # Optional: extra parameters for multiqc.
    log:
        "logs/multiqc.log"
    wrapper:
        "0.77.0/bio/multiqc"

【问题讨论】:

    标签: snakemake


    【解决方案1】:

    我认为您使用checkpointwrapper 使事情变得比必要的复杂。这就是我会做的,或多或少:

    rule guppy_basecall_persample:
        input:
            ...
        output:
            summary="basecall/{sample}/sequencing_summary.txt",                                                                                                                                                       
            directory=directory("basecall/{sample}/"),
        shell:
            r"""
            guppy ...
            """
    
    rule fastqc_pretrim:
        input:
            directory= directory("basecall/{sample}/"),
        output:
            html="qc/fastqc_pretrim/{sample}.html",
            zip="qc/fastqc_pretrim/{sample}_fastqc.zip"
        shell:
            r"""
            fastqc {input.directory}/pass/*.fastq.gz
            """
    

    【讨论】:

    • 我认为这种方法的问题在于并非所有样本都会通过过滤阶段。所以并不是每个样本文件最终都有一个 pass 文件夹,所以我需要有条件地使用通过的样本。我希望使用检查点来重新评估 SAMPLE 名称,并且只处理通过的名称。
    猜你喜欢
    • 1970-01-01
    • 2021-10-20
    • 1970-01-01
    • 2022-08-23
    • 1970-01-01
    • 2019-10-19
    • 2020-03-21
    • 2021-02-03
    • 1970-01-01
    相关资源
    最近更新 更多