【发布时间】:2014-11-20 17:09:11
【问题描述】:
我已经编程了几个月,所以我不是专家。我有两个巨大的文本文件(omni,~20 GB,~2.5M 行;dbSNP,~10 GB,~60M 行)。它们有前几行,不一定是制表符分隔,以“#”(标题)开头,其余行组织在制表符分隔的列(实际数据)中。
每行的前两列包含染色体编号和染色体上的位置,而第三列包含识别码。在“omni”文件中,我没有 ID,因此我需要在 dbSNP 文件(数据库)中找到位置,并创建第一个包含 ID 的文件的副本。
由于内存限制,我决定逐行读取这两个文件,然后从读取的最后一行重新开始。我对我的代码的效率不满意,因为我觉得它比它可能的要慢。我很确定这是我的错,因为缺乏经验。有没有办法使用 Python 让它更快?问题可能是文件的打开和关闭吗?
我通常像这样在 GNOME 终端(Python 2.7.6,Ubuntu 14.04)中启动脚本:
python -u Replace_ID.py > Replace.log 2> Replace.err
非常感谢您。
全方位 (Omni example):
...
#CHROM POS ID REF ALT ...
1 534247。 C T ...
...
dbSNP (dbSNP example):
...
#CHROM POS ID REF ALT ...
1 10019 rs376643643 TA T ...
...
输出应该与 Omni 文件完全相同,但位置后带有 rs ID。
代码:
SNPline = 0 #line in dbSNP file
SNPline2 = 0 #temporary copy
omniline = 0 #line in omni file
line_offset = [] #beginnings of every line in dbSNP file (stackoverflow.com/a/620492)
offset = 0
with open("dbSNP_build_141.vcf") as dbSNP: #database
for line in dbSNP:
line_offset.append(offset)
offset += len(line)
dbSNP.seek(0)
with open("Omni_replaced.vcf", "w") as outfile:
outfile.write("")
with open("Omni25_genotypes_2141_samples.b37.v2.vcf") as omni:
for line in omni:
omniline += 1
print str(omniline) #log
if line[0] == "#": #if line is header
with open("Omni_replaced.vcf", "a") as outfile:
outfile.write(line) #write as it is
else:
split_omni = line.split('\t') #tab-delimited columns
with open("dbSNP_build_141.vcf") as dbSNP:
SNPline2 = SNPline #restart from last line found
dbSNP.seek(line_offset[SNPline])
for line in dbSNP:
SNPline2 = SNPline2 + 1
split_dbSNP = line.split('\t')
if line[0] == "#":
print str(omniline) + "#" + str(SNPline2) #keep track of what's happening.
rs_found = 0 #it does not contain the rs ID
else:
if split_omni[0] + split_omni[1] == split_dbSNP[0] + split_dbSNP[1]: #if chromosome and position match
print str(omniline) + "." + str(SNPline2) #log
SNPline = SNPline2 - 1
with open("Omni_replaced.vcf", "a") as outfile:
split_omni[2] = split_dbSNP[2] #replace the ID
outfile.write("\t".join(split_omni))
rs_found = 1 #ID found
break
else:
rs_found = 0 #ID not found
if rs_found == 0: #if ID was not found in dbSNP, then:
with open("Omni_replaced.vcf", "a") as outfile:
outfile.write("\t".join(split_omni)) #keep the line unedited
else: #if ID was found:
pass #no need to do anything, line already written
print "End."
【问题讨论】:
-
您能否提供更多信息以进行测试?例如,为每个文件提供 5 行和显示替换过程完成的 5 行输出?它可能会让您更好地理解您提供的代码,并且还可以测试一些修改并验证一致性。
-
@ArthurVaïsse 谢谢你的建议。我编辑了原始消息。我只能添加两个链接,所以这里是输出文件示例:lucabetti.altervista.org/file/Output_example.vcf
标签: python performance text replace large-files