基于第一列合并行的Python脚本答案

【问题标题】：Python script to merge rows based on 1st column基于第一列合并行的Python脚本
【发布时间】：2018-07-24 10:22:51
【问题描述】：

我已经看到了很多关于此的问题/答案，但我所看到的都没有解决我的问题，所以任何帮助都将不胜感激。

我有一个非常大的 CSV 文件，其中包含一些重复的列条目，但我想要一个脚本来匹配和合并基于第一列的行。（我不想使用 pandas。我使用的是 Python 2.7。文件中没有 CSV 标头）

这是输入：

2144, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW 
8432, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW
0055, 0.00, 0.00, 2014, 2017
2144, 0.00, 0.00, 2016, 959
8432, 22.9, 0.00, 2015, 2018 
0055, 2014, 505, 20004, 2037, LL, GLO, X2, QAL

想要的输出：

2144, 0.00, 0.00, 2016, 959, 2016, 505, 20005, 2007, PP, GPP, DAC, UNSW  
0055, 0.00, 0.00, 2014, 2017, 2014, 505, 20004, 2037, LL, GLO, X2, QAL   
8432, 22.9, 0.00, 2015, 2018, 2015, 505, 20005, 2041, LL, GLO, X2, UNSW

我试过了：

reader = csv.reader(open('input.csv))
result = {}

for row in reader:
    idx = row[0]
    values = row[1:]
    if idx in result:
        result[idx] = [result[idx][i] or v for i, v in enumerate(values)]
    else:
        result[idx] = values

这用于搜索重复项：

with open('1.csv','r') as in_file, open('2.csv','w') as out_file:
    seen = set() # set for fast O(1) amortized lookup
    for line in in_file:
        if line in seen: continue

但是这些对我没有帮助-我迷路了

任何帮助都会很棒。

谢谢

【问题讨论】：

感谢 Piinthesky。我在上面进行了编辑。我迷路了，不知道从哪里开始

标签： python csv sorting merge

【解决方案1】：

尝试使用字典，将第一列的值作为键。以下是我的做法：

with open('myfile.csv') as csvfile:
    reader = list(csv.reader(csvfile, skipinitialspace=True))  # remove the spaces after the commas
    result = {}  # or collections.OrderedDict() if the output order is important
    for row in reader:
        if row[0] in result:
            result[row[0]].extend(row[1:])  # do not include the key again
        else:
            result[row[0]] = row

    # result.values() returns your wanted output, for example :
    for row in result.values():
        print(', '.join(row))

【讨论】：

谢谢。我希望这会奏效。我收到以下错误。 “ if row[0] in result: IndexError: list index out of range” 不知道为什么？有任何想法吗？再次感谢
我认为 reader = list(csv.reader(csvfile, skipinitialspace=True)) 应该可行。
谢谢你 - 不幸的是，现在它需要一些时间然后返回内存错误。
错误是什么，文件有多大？我猜该文件太大而无法放入内存。如果是这种情况，您将需要执行类似的步骤，但分块，将新输出写入文件。
错误是“ reader = list(csv.reader(csvfile, skipinitialspace=True)) MemoryError”。是的，该文件为 1,411,035 KB。像这样的东西...chunk, chunksize = [], 100 def process_chunk(chuck): print len(chuck)??谢谢