使用两个文件构建字典答案

【问题标题】：Building a dictionary using two files使用两个文件构建字典
【发布时间】：2013-07-02 20:44:36
【问题描述】：

我对 python 非常陌生，我一直在尝试使用两个文件编写脚本。文件 1 包含多个 ID 号，例如：

另一个文件有几行单个 ID 号，后面是由一个 ID 号和其他 ID 号组成的行，这些 ID 号末尾带有 + 或 -：

951450
8951670
8951800
8951863
8951889
9040311
9255087 147+ 206041- 8852164- 4458078- 1424812- 3631438- 8603144+ 4908786- 4780663+ 4643406+ 3061176- 7523696- 5876052- 163881- 6234800- 395660-
9255088 149+ 7735585+ 6359867+ 620034- 4522360- 2810885- 3705265+ 5966368- 7021344+ 9165926- 2477382+ 4015358- 2497281+ 9166415+ 6837601-
9255089 217+ 6544241+ 5181434+ 4625589+ 7433598+ 7295233+ 3938917+ 4109401+ 2135539+ 4960823+ 1838531+ 1959852+ 5698864+ 1925066+ 8212560+ 3056544+ 82N 1751642+ 4772695+ 2396528+ 2673866+ 2963754+ 5087444+ 977167+ 2892617- 7412278- 6920479- 2539680- 4315259- 8899799- 733101- 5281901- 7055760+ 8508290+ 8559218+ 7985985+ 6391093+ 2483783+ 8939632+ 3373919- 924346+ 1618865- 8670617+ 515619+ 5371996+ 2152211+ 6337329+ 284813+ 8512064+ 3469059+ 3405322+ 1415471- 1536881- 8034033+ 4592921+ 4226887- 6578783-

我想用这两个文件建立一个字典。我的脚本必须在文件 2 中搜索文件 1 中的 ID 号，并将这些行附加为值，其中键是文件 1 中的数字。因此，每个键可能有多个值。我只想搜索文件 2 中具有多个数字的行（如果 len(x) > 1）。

输出将类似于： 1000047: 9292540 1000047+ 9126889+ 3490727- 8991434+ 4296324+ 9193432- 3766395+ 9193431+ 8949379- （我需要打印 File1 中的每个 ID 号作为键和值，包含整个 ID 号的行）

这是我的 - 非常错误的 - 脚本：

#!/usr/bin/python

f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary

for l in f:
    p = l.rstrip()

d[p] = list()       # sets the keys in the dictionary as p (IDs with newline characters stripped)
y = z.readlines() # retrieves a string from the path file 
s = "".join(y)    # makes a string from y 
x = str.split(s)  #splits the path file at white spaces

if len(x) > 1:   # only the lines that include contigs IDs that were used to make another contig

    for lines in y:
        k = lines.rstrip()    
    w = tuple(x)    # convert list x into a tuple called w
    for i in w:         
        if i[:-1] in d:   
            d[p].append(k) 
print d

【问题讨论】：

请举例说明所需的输出。
通过编辑原始问题提供所需的输出，而不是在 cmets 中。

标签： python dictionary python-3.x

【解决方案1】：

试试：

#!/usr/bin/python

f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary

for l in f:
    p = l.rstrip()
    d[p] = list()       # Change #1

f.close()
# Now we have a dictinary with the keys from file1 and empty lists as values
for line in z:
    items = item.split() # items will be a list from 1 line
    if len(items) > 1: # more than initial item in the list
        k = items[0]   # First is the key line 
        for i in items[1:]: # rest of items
            if d.haskey(i[:-1]): # is it in the dict
                 d[i].append(k)  # Add the k value

z.close()
print d

注意这是未经测试的代码，但应该不会太远。

【讨论】：

谢谢！但什么是“项目”？

【解决方案2】：

这就是你要找的吗？？（我没有测试过……）

#!/usr/bin/python

f = open('file1')
z = open('file2')
d = dict() # d is an empty dictionary

for l in f.readlines():
    for l2 in z.readlines():
        if l.rstrip() in l2.rstrip():
            d[l] = l2
    z.seek(0, 0)

f.close()
z.close()

如果你不想处理文件指针，这里是一个更简单的版本相同的代码

f = open("file1")
z = open("file2")
d = dict() # d is an empty dictionary

file1_lines = f.readlines()
file2_lines = z.readlines()
for l in file1_lines:
    for l2 in file2_lines:
        if l.rstrip() in l2.rstrip():
            d[l] = l2

print d
f.close()
z.close()

【讨论】：

使用 with open('file1') as f, open('file2') as z 获得更好的文件 I/O。
这给了我输出：{'1000012\n': '9279863\t663068- 3473145+ 2405965- 5379610- 9170289- 2670268+ 8176642+ 1000012- 616493+ 62所以它只搜索file1中的第一个ID号，而不搜索其余的
是的，在内循环中读取文件后，文件指针必须恢复到文件开头。我已经用 seek 更新了上面的代码来恢复文件指针