两个文本文件之间的差异和交叉报告答案

【问题标题】：Diff and intersection reporting between two text files两个文本文件之间的差异和交叉报告
【发布时间】：2013-04-29 22:42:51
【问题描述】：

免责声明：我是一般编程和脚本的新手，所以请原谅缺乏技术术语

所以我有两个包含列出名称的文本文件数据集：

First File | Second File
bob        | bob
mark       | mark
larry      | bruce
tom        | tom

我想运行一个脚本（pref python），输出一个文本文件中的交叉线和另一个文本文件中的不同线，例如：

matches.txt：

bob 
mark 
tom

differences.txt：

bruce

我将如何使用 Python 完成此任务？或者使用 Unix 命令行，如果它足够简单的话？

【问题讨论】：

使用 sets 和标准文件 io ... 并在其中使用 string.split 以获得良好的衡量标准 :) ... 或者您尝试了什么，您在哪里卡住了？
Unix diff 命令不够好？
我怀疑订购不重要所以可能不...
请注意这些是垂直列出的（如果重要的话）
是的，订购无关紧要。我对 unix 更满意，但认为这将是学习如何使用 python 完成此任务的一个很好的练习。我是一个绝对的初学者

标签： python list shell compare

【解决方案1】：

排序 | uniq 很好，但 comm 可能会更好。 “man comm”了解更多信息。

来自手册页：

EXAMPLES
       comm -12 file1 file2
              Print only lines present in both file1 and file2.

       comm -3 file1 file2
              Print lines in file1 not in file2, and vice versa.

也可以使用 Python 的 set 类型，但是 comm 更简单。

【讨论】：

请注意，comm 需要排序文件作为输入。
这是我编写的一个旧工具，它在 Python 中执行各种集合操作。它不需要排序文件，它旨在从 shell 调用：stromberg.dnsalias.org/~strombrg/set-arithmetic

【解决方案2】：

Unix shell 解决方案-：

# duplicate lines
sort text1.txt text2.txt | uniq -d

# unique lines
sort text1.txt text2.txt | uniq -u

【讨论】：

OP 注意：要输出到文件，只需在命令末尾使用> file.txt 重定向输出，如下所示：sort text1.txt text2.txt | uniq -d > dups.txt
通过 [clfu]，对于重复项 (commandlinefu.com/commands/view/5707/…)：(sort -u file1; sort -u file2) | sort | uniq -d（这似乎是一样的，但更短）

【解决方案3】：

words1 = set(open("some1.txt").read().split())
words2 = set(open("some2.txt").read().split())

duplicates  = words1.intersection(words2)
uniques = words1.difference(words2).union(words2.difference(words1))

print "Duplicates(%d):%s"%(len(duplicates),duplicates)
print "\nUniques(%d):%s"%(len(uniques),uniques)

至少是这样的

【讨论】：

嘿，我有一个问题，如果文件太大，整个内容都会存储在集合中，对于大文件有什么有效的方法吗？

【解决方案4】：

Python 字典是 O(1) 或非常接近，换句话说，它们非常快（但如果您要索引的文件很大，它们会使用大量内存）。所以首先读入第一个文件并构建一个类似的字典：

left = [x.strip() for x in open('left.txt').readlines()]

列表推导和 strip() 是必需的，因为 readlines 会将尾随换行符完整的行交给您。这将创建文件中所有项目的列表，假设每行一个（如果它们都在一行上，则使用 .split）。

现在构建一个字典：

ldi = dict.fromkeys(left)

这将构建一个字典，其中列表中的项目作为键。这也处理重复项。现在遍历第二个文件并检查密钥是否在字典中：

matches = open('matches.txt', 'w')
uniq = open('uniq.txt', 'w')
for l in open('right.txt').readlines():
    if l.strip() in ldi:
        # write to matches
        matches.write(l)
    else:
        # write to uniq
        uniq.write(l)
matches.close()
uniq.close()

【讨论】：

想一想，这不会在 left.txt 中找到唯一的名称。很简单，只需镜像 dict 解决方案即可，但您也可以查看 python“set”类型，它可以让您轻松确定交集/差异。

【解决方案5】：

>>> with open('first.txt') as f1, open('second.txt') as f2:
        w1 = set(f1)
        w2 = set(f2)


>>> with open('matches.txt','w') as fout1, open('differences.txt','w') as fout2:
        fout1.writelines(w1 & w2)
        fout2.writelines(w2 - w1)


>>> with open('matches.txt') as f:
        print f.read()


bob
mark
tom
>>> with open('differences.txt') as f:
        print f.read()


bruce

【讨论】：

【解决方案6】：

用水平线制作一个；

file_1_list = []

with open(input('Enter the first file name: ')) as file:
    file_1 = file.read() 

    file.seek(0) 

    lines = file.readlines()
    for line in lines:
        line = line.strip()
        file_1_list.append(line)

 with open(input('Enter the second file name: ')) as file:
    file_2 = file.read()
    file.seek(0)
    lines = file.readlines()
    for line in lines:
        line = line.strip()

if file_1 == file_2:
    print("Yes")

else:
        print("No")
        print(file_1)
        print("--------------")
        print(file_2)

【讨论】：