【发布时间】:2021-06-21 11:49:09
【问题描述】:
我有 1,500 个 fasta 文件,其中包含许多蛋白质片段。我的目标是将这些片段分成单个文件,并以直观的方式命名这些文件。
下面是我称之为 plate9.H7.faa 的 fasta 文件示例:
>39_fragment_4_295 (310978..311196) 1 None hypothetical protein
MQTATKQETYDRTMKVTLAVKANGGSVTVQIQAGDNWITTDTFWKDGGYQLSIPPATIRYVPAAGAAFEVYA*
>39_fragment_4_296 (311193..312437) 1 VOG01158 REFSEQ hypothetical protein
MSLLVNPIPRRQPIRRGLGLLGDSFSGNCHTIAATAFGTEAYGYAGWIAARTGLFPSYVDNQGKLGDHTGQFLARLPACIASSTADLWLLLSRTNDSTTAGMSLADTKANVMKIVTAFLNTPGKYLIIGTGTPRFGSRALTGQALADAIAYKDWVLSYVSQFVPVVNIWDGFTEAMTVEGLHPNLLGAEFISSRVVPIITANFEFPGIPLPTDAGDIYSAIRPFGCLNANPLLAGTGGTLPAGVNAAAGSVLADGYKAVGSGLTGITTRWFKEPAAYGEAQCIELRGNMAAAGGYIYMQPTANVVQTNLAAGDVIEMVSAVEIMGSSRGILAWEAELTITKTVSGAASTFYYRSMDKYQEPFTMPASFSGALETQRGTIDLTETVITSRMGLYLAAGVPQDSTVKAAQFGIRKV*
>56_fragment_9_667 (768674..769846) -1 K14059 int; integrase
MGRDGRGVRAVSDTSIEITFMYRGVRCRERITLKPSPTNLKKAEQHKAAIEHAISIGAFDYSVTFPGSPRAAKFAPEANRETVAGFLTRWLDGKKRHVSSSTFVGYRKLVELRLVPALGERMVVDLKRKDVRDWLSTLEVSNKTLSNIQSCLRSALNDAAEEELIEVNPLAGWTYSRKEAPAKDDDVDPFSPEEQQAVLAALNGQARNMMQFALWTGLRTSELVALDWGDIDWLREEVMVSRAMTQAAKGQAEVPKTAAGRRSVKLLRPAMEALKAQKAHTFLADAEVFQNPRTLQRWAGDEPIRKTMWVPAIKKAGVNYRRPYQTRHTYASMMLSAGEHPMWVAKQMGHSDWTMIARVYGRWMPYWDDIAGTKAVSQWAENAHESSDSK*
>56_fragment_9_668 (770054..770281) -1 PF02599.16 Global regulator protein family
MLCLSRRVGESIVIGDNIKITVISGRDGQIRLGIDAPAELAVDRSEVRTAKLATPCGIGLKLRTVAESGARDDEG*
>56_fragment_9_669 (770485..770697) 1 None hypothetical protein
MECTTTADEVYGPRNAKLGKRAVDGNIWSGTTMIFRIIDDRVYSMHEQYLGRLKYGMAMTDRGELIFIVR*
>56_fragment_9_670 (770705..771487) -1 VOG00563 sp|Q05292|VG77_BPML5 Gene 77 protein
MSESTIDPKKLERAIRKIKHCLALSQSSNENEAATAMRQAQALMREYHLTETDVKVSDVGEVESSMSRAARRPLWDQQLSAVVATVFNVKALRYTHWCETKKNRVERAKFVGVSPAQHIALYAYETLLAKLSQARNAYVAGVRAGKFRSSYSAPTAGDHFAIAWVFAVESKLQQLVPRGEENTTPEYKGAGPGLVAVEAQHQALIDSYLADKQVGKARKVRGSELDLNAQIAGMLAGTKVDLHAGLANGAEHAQVLPASA*
到目前为止,我已经能够使用此命令将文件拆分为多个文件:
for x in *.faa; do csplit -z $x '/>/' '{*}'; done
然后根据它们在头部的片段重命名:
for file in xx*; do mv "$file" `head -1 "$file" | cut -d$'\t' -f 1`_$x.fasta; done
然后重命名每个文件,使其不包含每个文件中的“>”,并为其分配原始文件名:
for i in *.fasta; do mv $i `echo $i | cut -c 2-`; done
我的问题是这适用于单个文件(因为在我正在执行此操作的目录中有临时文件,它们暂时称为 xx00、xx01、xx02、xx03 等。
我觉得我的解决方案是遍历每个 fasta 文件并在开始下一个 fasta 文件之前连续执行所有这些 for 循环,我觉得这必须是我从未做过的嵌套 for 循环我。任何关于我能做什么的指导将不胜感激。
【问题讨论】:
标签: linux bash for-loop nested-loops fasta