【问题标题】:How to fill in missing sequence lines in a TSV file如何在 TSV 文件中填充缺失的序列行
【发布时间】:2017-06-01 14:46:15
【问题描述】:

我仍然是初学者,所以对于初学者来说,对于这个问题可能有一个明显的答案感到抱歉,并且对于混乱的代码感到抱歉,但是我有包含一万行的文件。我正在使用某种窗口框架技术来滑动我的文件,所以我需要确保每个窗口都在那里。但是,我的一些输入文件遗漏了某些行,所以我尝试用 Python 编写代码来添加这些行和我想要的信息,以使文件完整。代码是这样的:

#!/usr/bin/env python

outfile = open ("missing_test.txt", "w")

with open("add_missing.txt", "r") as file:
    last_line = 0   #This is where it starts for bin 1
    lines = []
    header_line = next(file)
    outfile.write(header_line)
    CHROM = 'BABA_1'
    for line in file:     #go through every line to check its existence and rewrite to new file
        nums = line.split("\t")
        num1 = nums[0]        #no integer because this is a string: name individual
        num2 = int(nums[1])   #integer for window
        num3 = int(nums[2])   #integer for coverage (here always 10000 to met treshold)
        num4 = int(nums[3])   #integer for SNP count   
        if num1 == CHROM:     #
            while num2 != last_line + 10000:
                #A line is missing, so a new line is added with 0 SNPs:
                NUM2 = last_line + 10000   # New window, the one that was missing
                NUM4 = 0   #0 SNPs found
                #lines.append((num1, NUM2, num3, NUM4))
                OUTLINE = "%s\t%s\t%s\t%s" % (num1, NUM2, num3, NUM4) #write new line to outfile       
                outfile.write(OUTLINE + "\n")
                last_line += 10000
            lines.append((num1,num2,num3,num4))
            last_line += 10000    #also add 10000 here otherwise the while loop makes no sense
            outline = "%s\t%s\t%s\t%s" % (num1, num2, num3, num4)
            outfile.write(outline + "\n")   #write all existing lines to outfile

        else:
            CHROM = num1
            last_line = 0

outfile.close()        

所以只要第一个“CHROM”的第一个窗口等于 0,它就可以正常工作,但情况并非总是如此。在后一种情况下,循环将是无限的。例如,输入和 DESIRED 输出如下所示:

输入:

indiv   window  coverage    SNP
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  80000   10000   1
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 80000   10000   9

期望的输出:

indiv   window  coverage    SNP
BABA_1  10000   10000   0
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  40000   10000   0
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  70000   10000   0
BABA_1  80000   10000   1
BABA_10 10000   10000   0
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 40000   10000   0
BABA_10 50000   10000   0
BABA_10 60000   10000   0
BABA_10 70000   10000   0
BABA_10 80000   10000   9

我一直在努力寻找答案,以使我的这个 while 循环工作而不会无限进行,但我真的看不出我的缺陷。有没有人告诉我如何解决这个问题?

非常感谢任何帮助,在此先感谢!

【问题讨论】:

  • 基本上,如果“CHROM”不等于0,你想退出while循环,对吗?
  • 不,CHROM 实际上只是一个字符串,一旦字符串发生变化,我想重新开始
  • 您需要注意区分大小写。 num1NUM1 不一样。
  • 让我直说。你的窗口线都是从 10000 到 80000 的 10000 串,对吧?并且这些集合的数量等于不同 BABA_* 的数量
  • 嗨,我意识到这实际上是我这样做的原因,因为大写字母变量只有在缺少一行时才会出现,我会自己添加。 NUM2 将是新窗口,在这种情况下 NUM4 将始终为 0,其余相同。

标签: python loops while-loop infinite


【解决方案1】:

尝试以下方法:

#!/usr/bin/python

outfile = open ("missing_test.txt", "w")

def write_line(indiv, window, coverage, snp):
    outline = "%s\t%s\t%s\t%s\n" % (indiv, window, coverage, snp)
    outfile.write(outline)

with open("add_missing.txt", "r") as file:
    lines = file.readlines()
    write_line(*lines.pop(0).rstrip().split("\t"))
    first_line = lines[0].split("\t")
    last_indiv = first_line[0]
    last_window = int(first_line[1])

    for line in lines:
        indiv, window, coverage, snp = line.split("\t")
        window = int(window)
        coverage = int(coverage)
        snp = int(snp)

        if indiv == last_indiv:
            # If the current window is higher than expected,
            # insert a line with the missing window.
            # Repeat until we get to the expected window.
            while window > last_window + 10000:
                write_line(indiv, last_window + 10000, coverage, 0)
                last_window += 10000
            last_window = window
        else:
            last_indiv = indiv
            last_window = window
        write_line(indiv, window, coverage, snp)

它不包含期望某个窗口编号是给定indiv 中的第一个,因为您没有定义该行为并且您对此的评论相当混乱。


运行此脚本后missing_test.txt的内容:

indiv window  coverage    SNP
BABA_1    20000   10000   7
BABA_1    30000   10000   1
BABA_1    40000   10000   0
BABA_1    50000   10000   2
BABA_1    60000   10000   3
BABA_1    70000   10000   0
BABA_1    80000   10000   1
BABA_10   20000   10000   1
BABA_10   30000   10000   16
BABA_10   40000   10000   0
BABA_10   50000   10000   0
BABA_10   60000   10000   0
BABA_10   70000   10000   0
BABA_10   80000   10000   9

【讨论】:

  • 这看起来更整洁了,谢谢!对于指定的输入子集,这对您有用吗?我没有将任何输出写入输出文件(我在下面添加了关闭语句)?
  • 为我没有尝试它而感到羞耻。我已经尝试并修复了一些遗留的错误。
  • 如果您需要所有条目从窗口 0 或 10000 开始,我将把它留给您作为练习:-D
  • 我非常感谢你,我会让它与正确的 startbin/window 一起工作。再次感谢!
【解决方案2】:

您可以使用以下方法,首先构建一个空列表,然后将任何现有条目分配到其中,然后将它们作为行写入输出:

import csv
import itertools

with open('add_missing.txt', 'rb') as f_input, open('missing_test.txt', 'wb') as f_output:
    csv_input = csv.reader(f_input, delimiter='\t', skipinitialspace=True)
    csv_output = csv.writer(f_output, delimiter='\t')
    csv_output.writerow(next(csv_input))

    for k, g in itertools.groupby(csv_input, lambda x: x[0]):
        empty = [[k, x * 10000, 10000, 0] for x in range(1, 9)]
        for row in g:
            empty[int(row[1]) / 10000 - 1] = row

        csv_output.writerows(empty)   

给你:

indiv   window  coverage    SNP
BABA_1  10000   10000   0
BABA_1  20000   10000   7
BABA_1  30000   10000   1
BABA_1  40000   10000   0
BABA_1  50000   10000   2
BABA_1  60000   10000   3
BABA_1  70000   10000   0
BABA_1  80000   10000   1
BABA_10 10000   10000   0
BABA_10 20000   10000   1
BABA_10 30000   10000   16
BABA_10 40000   10000   0
BABA_10 50000   10000   0
BABA_10 60000   10000   0
BABA_10 70000   10000   0
BABA_10 80000   10000   9

【讨论】:

  • 您,先生,真是个天才!直到现在我才知道iterools.groupby
猜你喜欢
  • 2017-09-23
  • 1970-01-01
  • 2013-01-13
  • 2015-11-20
  • 2019-05-16
  • 2015-10-06
  • 1970-01-01
  • 2011-04-03
  • 2020-10-02
相关资源
最近更新 更多