【发布时间】:2020-12-13 10:47:50
【问题描述】:
我正在编写一个使用 gawk 从 GTF 文件中提取行的脚本,它在对“printf”的连续调用之间添加了一个额外的空格。对于那些不熟悉 GTF 文件的人来说,它们是一种常见的基因组格式,由 9 个制表符分隔的字段组成,其中第 9 个字段存储由组合分号和空格“;”分隔的键值属性对列表。目标是提取具有特定“gene_name”的行并作为输入文件 1 中的单列文本传递。
处理脚本中的所有内容都按预期工作,除了在内部 for 循环中的最终 printf 迭代和插入“换行符”字符的 printf 语句之间以某种方式引入了额外的空白。
示例输入文件 1:
(base) [user@host MouseEnsembl100]$ head gene_names.txt
Cryaa
Cryab
Crygc
示例输入 GTF 文件(文件 2):
(base) [user@host MouseEnsembl100]$ head example.gtf
17 ensembl_havana gene 31677807 31681733 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; gene_name "Cryaa";
17 havana transcript 31677807 31681733 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa";
17 havana exon 31677807 31678189 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa";
17 havana CDS 31678001 31678189 . + 0 gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa";
17 havana start_codon 31678001 31678003 . + 0 gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa";
17 havana exon 31679559 31679681 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa";
处理脚本:
#!/bin/bash
#SBATCH --job-name=make_rseqc_bed
#SBATCH --mem=32000
#SBATCH --ntasks=4
TARGETS=/work/abf/MouseEnsembl100/gene_names.txt
TGTLABL=gene_name
GTFPATH=/work/abf/MouseEnsembl100/example.gtf
if [ ! -z $TGTLABL ] && [ ! -z $TARGETS ]
then
gawk -v lbl=${TGTLABL}\
-v FS="\t| |;"\
-v OFS=''\
-v ORS=''\
'(NR == FNR) {tgt[$1]; next}
(NR != FNR) {gsub("; ",";")}
(NR != FNR)\
{
for(i=0; i<=NF; i++){
if($i == lbl){
gsub("\042","",$(i+1))
if($(i+1) in tgt){
$(i+1)="\042"$(i+1)"\042"
for(j=1; j<=NF;j++){
if(j < 9) {
printf($j"\t")
}
else if( (j % 2) == 1){
printf($j" ")
}
else if( (j % 2) == 0 && (j+1) < NF){
printf($j"; ")
}
else if((j+1) == NF){
printf($j";LAST_FIELD")
}
}
printf("%s\n","NEXT LINE")
}
}
}
}' $TARGETS $GTFPATH >> extracted_targets.gtf
fi
示例输出:
17 ensembl_havana gene 31677807 31681733 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; gene_name "Cryaa"; NEXT LINE
17 havana transcript 31677807 31681733 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa"; NEXT LINE
17 havana exon 31677807 31678189 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa"; NEXT LINE
17 havana CDS 31678001 31678189 . + 0 gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa"; NEXT LINE
17 havana start_codon 31678001 31678003 . + 0 gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa"; NEXT LINE
17 havana exon 31679559 31679681 . + . gene_id "ENSMUSG00000024041"; gene_version "10"; transcript_id "ENSMUST00000228716"; gene_name "Cryaa"; NEXT LINE
【问题讨论】:
标签: awk whitespace bioinformatics removing-whitespace