在 Python 中加入具有公共列的文件答案

【问题标题】：Join Files with common columns in Python在 Python 中加入具有公共列的文件
【发布时间】：2014-04-18 16:01:37
【问题描述】：

我在连接两个具有 5 个公共列的大文件并返回结果时遇到问题，它们是相同的 5 个元组... 这就是我的意思：

文件 1：

132.227 49202 107.21 80
132.227 49202 107.21 80
132.227 49200 107.220 80
132.227 49200 107.220 80
132.227 49222 207.171 80
132.227 49339 184.730 80
132.227 49291 930.184 80
............
............
............

该文件包含很多行，而不仅仅是那些...

文件 2：

46.109498000 132.227 49200 107.220 80 17 48 
46.927339000 132.227 49291 930.184 80 17 48 
47.422919000 253.123 1985 224.300 1985 17 48
48.412761000 132.253 1985 224.078 1985 17 48
48.638454000 132.127 1985 232.123 1985 17 48
48.909658000 132.227 49291 930.184 80 17 65
48.911360000 132.227 49200 107.220 80 17 231
............
............
............

输出文件：

46.109498000 132.227 49200 107.220 80 17 48 
46.927339000 132.227 49291 930.184 80 17 48 
48.909658000 132.227 49291 930.184 80 17 65
48.911360000 132.227 49200 107.220 80 17 231
............
............
............

这是我写的代码：

with open('log1', 'r') as fl1:
    f1 = [i.split(' ') for i in fl1.read().split('\n')]

with open('log2', 'r') as fl2:
    f2 = [i.split(' ') for i in fl2.read().split('\n')]

def merging(x,y):
    list=[]
    for i in x:
        for j in range(len(i)-1):
            while i[j]==[a[b] for a in y]:
                list.append(i)
                j=j+1
    return list

f3=merging(f1,f2)

for i in f3:
    print i

【问题讨论】：

@mskimm ，是的，第二个文件按第一列排序（这是开始时间）
m 抱歉，我无法在此处添加另一个 comm，所以我将在这里写下它的 'python --version' 是 Python 2.7.6

标签： python sorting join merge

【解决方案1】：

我认为是 file2 被 file1 过滤了。对吧？

我假设 file1 没有排序。（如果订购了，还有另一种有效的解决方案）

with open('file1') as file1, open('file2') as file2:
    my_filter = [line.strip().split() for line in file1]
    f3 = [line.strip() for line in filter(lambda x: x.strip().split()[1:5] in my_filter, file2)]

# to see f3
for line in f3:
    print line

首先，构建过滤器my_filter = [line.strip().split() for line in file1]，其中包含

[['132.227', '49202', '107.21', '80'], ['132.227', '49202', '107.21', '80'], ['132.227', '49200', '107.220', '80'], ['132.227', '49200', '107.220', '80'], ['132.227', '49222', '207.171', '80'], ['132.227', '49339', '184.730', '80'], ['132.227', '49291', '930.184', '80']]

然后使用filter，过滤数据。此代码适用于 Python 2.7 +

【讨论】：

是的，我必须先过滤第二个文件，然后返回第一个格式（列）
对不起，我刚刚看到另一个 Qst，没有订购文件 1，我不确定是否要订购它，因为文件 2 是按第一列排序的（文件 1 中不存在）
我的 Mac 上有 Python 2.7，但当我尝试打印输出时仍然存在（对于 f3 中的 i：打印 i）我什么也没得到……它是空的 :(
我有没有提到这两个文件都包含数千行（file1 中的 1000 行和 file2 中的大约 30000 行，这就是我要过滤的原因）
数据格式可以变化。请以文件的形式提供您的数据以获得更好的答案。

【解决方案2】：

我写了这几行，它们似乎有效：

with open('file1', 'r') as fl1:
    f1 = [i.split(' ') for i in fl1.read().split('\n')]

with open('file2', 'r') as fl2:
    f2 = [i.split(' ') for i in fl2.read().split('\n')]

for i in f2:
    for j in f1:
        if i[1]==j[0] and i[2]==j[1] and i[3]==j[2] and i[4]==j[3]:
            print i

我尝试替换

if i[1]==j[0] and i[2]==j[1] and i[3]==j[2] and i[4]==j[3]:

与：

for k in range(4):
    if i[k+1]==j[k]:
        print i

但它给了我这个错误：

Traceback（最近一次调用最后一次）：文件“MERGE.py”，第 10 行，在 if i[k+1]==j[k]: IndexError: list index out of range

【讨论】：