比较来自不同文件的单词答案

【问题标题】：Comparing words from different files比较来自不同文件的单词
【发布时间】：2018-10-24 19:30:35
【问题描述】：

我是 Python 新手，遇到了一个问题。我编写了代码来识别总字数以及多个文件的唯一字数（在这种情况下，.txt 文件是一本书的章节：来自 file1 的示例文本“在什么时间段内对可变性的原因存在争议，无论它们是什么，通常都会起作用；无论是在胚胎发育的早期或晚期，还是在受孕的那一刻。”；来自文件 2 的示例文本“最后，品种具有与物种相同的一般特征，因为它们无法与物种区分开来，除非首先通过发现中间连接形式”）。

我在网上找不到任何关于如何比较文件之间单词的示例。我需要确定文件之间共享的单词数以及每个文件唯一的单词数（相对于其他文件）。我的最终输出应包括 7 个数字：file1 和 file2 的总字数、file1 和 file2 的唯一字数、file1 和 file2 之间共享的字数、file1 中但不在 file2 中的字数，最后是 file2 中的字数但不在file1中。我知道我必须使用 set() 来执行此操作，但我不明白如何。

import glob
from collections import Counter

path = "c-darwin-chapter-?.txt"

wordcount = {}

for filename in glob.glob(path):
  with open("c-darwin-chapter-1.txt", 'r') as f1, open("c-darwin-chapter-2.txt", 'r') as f2:
      f1_word_list = Counter(f1.read().replace(',','').replace('.','').replace("'",'').replace('!','').replace('&','').replace(';','').replace('(','').replace(')','').replace(':','').replace('?','').lower().split())

      print("Total word count per file: ", sum(f1_word_list.values()))
      print("Total unique word count: ", len(f1_word_list))

      f2_word_list = Counter(f2.read().replace(',','').replace('.','').replace("'",'').replace('!','').replace('&','').replace(';','').replace('(','').replace(')','').replace(':','').replace('?','').lower().split())

      print("Total word count per file: ", sum(f2_word_list.values()))
      print("Total unique word count: ", len(f2_word_list))

#if/main commented out but final code must use if/main and loop
#if __name__ == '__main__':
#   main()

期望的输出：

Total word count
   Chapter1 = 11615
   Chapter2 = 4837

Unique word count
   Chapter1 = 1991
   Chapter2 = 1025

Words in Chapter1 and Chapter2: 623
Words in Chapter1 not in Chapter2: 1368
Words in Chapter2 not in Chapter1: 402

【问题讨论】：

你应该在你的问题中包含两个文本样本（作为文本，而不是图像），以及你想要的输出。
文本文件是一本书的章节（很长），过去有人告诉我不要发布很长的问题，所以我没有包含它。将更新以包含所需的输出
我只包含 small 个文本样本，以便那些回答的人有数据可以使用。
好的，我将更新问题以反映这一点；谢谢！
阅读class set([iterable])。相关finding-the-intersection-of-the-paired-typed-lists-collection-of-strings-in-py

标签： python python-3.x text compare

【解决方案1】：

您读取两个文件并将读取的文本转换为列表/集合。使用集合，您可以使用集合运算符来计算它们之间的交集/差异：

s.intersection(t)    s & t    new set with elements common to s and t  
s.difference(t)      s - t    new set with elements in s but not in t
集合操作的解释表可以在这里找到：Doku 2.x / valid for 3.7 as well

演示：

file1 = "This is some text in some file that you can preprocess as you " +\
        "like. This is some text in some file that you can preprocess as you like."

file2 = "this is other text about animals and flowers and flowers and " +\
        "animals but not animal-flowers that has to be processed as well"

# split into list - no .lower().replace(...) - you solved that already
list_f1 = file1.split() 
list_f2 = file2.split()

# create sets from list (case sensitive)
set_f1 = set( list_f1 )
set_f2 = set( list_f2 )

print(f"Words: {len(list_f1)} vs {len(list_f2)} Unique {len(set_f1)} vs {len(set_f2)}.")
# difference
print(f"Only in 1: {set_f1-set_f2} [{len(set_f1-set_f2)}]")
# intersection
print(f"In both {set_f1&set_f2} [{len(set_f1&set_f2)}]")
# difference the other way round
print(f"Only in 2:{set_f2-set_f1} [{len(set_f2-set_f1)}]")

输出：

Words: 28 vs 22 Unique 12 vs 18.
Only in 1: {'like.', 'in', 'you', 'can', 'file', 'This', 'preprocess', 'some'} [8]
In both {'is', 'that', 'text', 'as'} [4]
Only in 2:{'animals', 'not', 'but', 'animal-flowers', 'to', 'processed',
           'has', 'be', 'and', 'well', 'this', 'about', 'other', 'flowers'} [14]

您已经在处理文件读取并将其“统一”为小写 - 我把它放在这里了。输出使用 python 3.6 的字符串插值语法：见PEP 498

【讨论】：