在 Python 中比较 CSV 文件 - 循环问题答案

【问题标题】：Comparing CSV files in Python - Trouble with loops在 Python 中比较 CSV 文件 - 循环问题
【发布时间】：2014-12-16 22:02:03
【问题描述】：

我正在尝试设计一个 python 脚本来读取一系列 CSV 文件，选择第一列并将每一行（行）与主 CSV 文件进行比较。如果有任何匹配项，则它将匹配项和行号打印到控制台中。它还会发布在另一个 CSV 文件中找到的文件名和匹配项。

到目前为止，除了我正在进行的这个该死的循环之外，一切都运行良好。一旦脚本进入循环，它就会飞过它并为位于 CSV 文件目标目录中的每个文件重新循环。我知道它正在处理循环，因为它将每个 CSV 文件中的行数输出到控制台中。但是，它不会打印出是否找到匹配项，所以我的嵌套循环发生了一些事情。

for eachFile in files:

#each file being compared
target = scanDir+eachFile

#print a message to the console letting the user know the file we're processing
print
print 'Scanning begun on: ' + target

#open the master file we'll be using during this loop
f1 = file(masterFile, 'r')
csv1 = csv.reader(f1)



with open(target, 'rb') as targetFile:

    #for fun, let's output the rows we'll be processing in the target file
    numberOfRows = sum(1 for row in targetFile)
    print 'This file contains ' + str(numberOfRows) + ' rows to review.' 

    reader = csv.reader(targetFile)

    for targetRow in reader: #not processing this loop :(
        foundMatch = False
        for masterRow in csv1:
            if targetRow[0] == masterRow[0]:
                lineNumber = targetFile.line_num
                print 'MATCH FOUND! ' + targetRow[0] + 'found on row ' + lineNumber
                print
                _includes.CVSWriter.writeRow(target, targetRow[0])
                foundMatch = True
            if not foundMatch:
                print 'No matches found in ' + target
                print
f1.close()

print 'Scanning Completed'
print

我有六个文件供循环扫描，所有文件的长度和值都不同。我什至有一个完全空白的，但仍然没有说“未找到匹配项”。我完全不知所措，我确信这很容易解决，但在这一点上，我可以从外部看看它。提前谢谢！

【问题讨论】：

在嵌套循环中打印targetRow[0] 和masterRow[0] 的值是一个好的开始。这样您就可以调试实际值是什么，并检查为什么比较总是返回 False。
另外，您正在循环遍历csv1 中的每个值，而reader 仍处于第一次迭代中。在reader 的第二次迭代中，csv1 可能不会按照您的预期进行，因为您无法两次遍历 csv 阅读器。
补充无花果所说的，您是否尝试过重置您的迭代器ala stackoverflow.com/questions/2868354？
研究pandas。它使这种东西变得微不足道。但是没有看到样本数据，我不能给你一个确切的答案。但它会变成这样：master = pd.read_csv('master'); current = pd.read_csv('current'); current[master.first_column == current.first_column].first_column
我阅读了 Matt Ball 发布的另一篇文章，我认为我明白了，所以它正在迭代，但结果是一样的。在 'for masterRow in csv1' 之后使用：' f1.seek(0)' 我“认为”这就是我需要重置它的地方？

标签： python regex csv

【解决方案1】：

从您的代码中删除行数，您正在用尽输入文件和 reader 循环，您说不执行，正在执行但立即停止，因为您的 reader没有什么可看的。

附录

我建议您立即输入您的参考文件并仅将第一行项目保存在 set 数据结构中（在此示例中，我必须使用与您不同的数据，因为您不想问完整的问题）

from  csv import reader
ref = 'cc2012xyz2_5_5dp.csv'
ref_set = {el[0] for el in reader(open(ref))

最后一行是set comprehension

现在您已准备好迭代目标文件（在我的示例中，只有一个文件...）

for tgf in ('cc2012xyz2_5_6dp.csv',):
    rtg = reader(open(tgf))
    matches = 0
    for tg_row in rtg: 
        if tg_row[0] in s1:
            print '# MATCH FOUND! ', tg_row[0], 'found on row', rtg.line_num
            matches += 1
    if matches == 1:
        print '# In file',f2,'there is 1 (one) match.'
    elif matches:
        print  '# In file',f2,'there are',matches,'matches.'
    else:
        print  '# In file',f2,'there are no matches.'

当我在我的数据文件上运行上面的代码时，我得到以下输出

# MATCH FOUND!  395.0 found on row 2
# MATCH FOUND!  420.0 found on row 7
# MATCH FOUND!  445.0 found on row 12
# MATCH FOUND!  460.0 found on row 15
# MATCH FOUND!  475.0 found on row 18
# MATCH FOUND!  510.0 found on row 25
# In file cc2012xyz2_5_6dp.csv there are 6 matches.

关于你的台词我无能为力

        _includes.CVSWriter.writeRow(target, targetRow[0])

因为我不知道这些东西的全部内容（我做了 google 搜索，但唯一的命中来自你的问题......）

此外，如果您还有 IndexError 我猜想（例如，您没有提出正确的问题）这意味着您的某些数据不是 csv 格式reader 可以正确解析。

除非您认真编辑您的问题，否则我无法提供进一步的帮助。再见。

【讨论】：

我删除了计数器并打印出了 targetRow 和 masterRow 变量。两者都正确打印，但是，当我尝试使用 targetRow[0] 和 masterRow[0] 再次运行它时，它给我一个错误，说列表索引超出范围。这就是我迷失的地方......