从文件中提取特定范围的 fasta 序列答案

【问题标题】：extract a specific range of fasta sequences from a file从文件中提取特定范围的 fasta 序列
【发布时间】：2020-04-05 11:08:37
【问题描述】：

我正在尝试从特定范围中提取序列。我使用的命令只能提取 fasta 序列中的前 n 行

awk "/^>/ {n++} n>2000 {exit} {print}" Name.faa > Name_2k_cds.faa

如果我想从特定范围（例如 2000 到 3000）中提取序列，我该怎么做？我现有的代码中是否有一个简单的编辑。

谢谢！

【问题讨论】：

欢迎来到 SO，请在您的问题中发布输入和预期输出示例并让我们知道。不过，特别感谢您在问题中发表您的努力。
能否请您检查一下我的回答，让我知道这是否对您有帮助？

标签： linux bash awk command-line fasta

【解决方案1】：

你可以试试这个：

sed -n '2000,3000p' Name.faa > Name_2k_to_3k_cds.faa

解释：

sed -n       # suppress automatic printing of pattern space
'2000,3000p' # print only line 2000 to 3000

【讨论】：

恕我直言，如果我的问题正确，OP 不想从行号打印。 OP 想要计算从 > 开始的行数，并且该计数的行号从 2000 到他/她需要的 3000。

【解决方案2】：

请您尝试关注一下。

awk '/^>/{n++} n>=2000 && n<=3000;n==3000{exit}' Name.faa > Name_2k_cds.faa

说明：在此处添加对上述代码的说明。

awk '                             ##Starting awk program from here.
/^>/{n++}                         ##Checking condition if a line starts from > then do following.
n>=2000 && n<=3000                ##Checking condition if value of n is greater than or equal than 2000 AND lesser than or equal to 3000 then print that line.
n==3000{                          ##Checking condition if value of n is 3000 then exit from this program, NO NEED to read whole Input_file since we need only 2000 to 3000 lines only.   
  exit                            ##Using exit to exit from code.
}
' Name.faa > Name_2k_cds.faa      ##Mentioning Input_file name and re-directing its output to another output file.

【讨论】：

【解决方案3】：

对@RavinderSingh13 提出的解决方案稍作补充

awk '/^>/{n++} n>=2000 && n<=3000;n==3001{exit}' Name.faa > Name_2k_cds.faa

这确保序列 3000 也存储在新文件中，而原始解决方案的输出提取序列 3000 的标题，而不是序列本身。

【讨论】：