【问题标题】:Compare, remove, and count words in Text file比较、删除和统计文本文件中的单词
【发布时间】:2019-11-07 14:34:16
【问题描述】:

我想比较两个文本文件 f1.txt 和 f2.txt,从 f2.txt 中删除两个文件中的常用词,然后按频率降序对新的 f2.txt 进行排序

我的做法:

  1. 列出来自 f1.txt 和 f2.txt 的单词。
  2. 从文本输入中删除不需要的字符。
  3. 比较两个列表,从 f2.txt 生成的列表中删除常用词
  4. 按频率对从 f2.txt 生成的列表中的单词进行排序
with open(sys.argv[1]) as f1,open(sys.argv[2]) as f2:
    passage = f2.read()
    common = f1.read()
words = re.findall(r'\w+', passage)
common_words = re.findall(r'\w+', common)
passage_text = [words.lower() for words in words]
final = set(passage_text) - set(common_words)
word_count = Counter(final)
for word, count in word_count.items():
    print(word, ":", count)

我希望输出是这样的:

Foo:          12
Bar:          11
Baz:           3
Longword:      1

但我将每个单词的计数频率设为1

【问题讨论】:

    标签: python-3.x word-count


    【解决方案1】:

    这里有两种计算文本文件中字数的方法。

    from re import split
    
    
    def process_line(words, word_dict):
        for word in words:
            if word in word_dict:
                word_dict[word] += 1
            else:
                word_dict[word] = 1
    
    
    def process_dict(word_dict):
        temp_list = []
        for key, value in word_dict.items():
            temp_list.append((value, key))
    
        temp_list.sort()
        return temp_list
    
    
    def format_print(input_list, reverse, word_num):
        if reverse:
            input_list.sort(reverse=True)
    
        print("\n", ("[Unique Words: " + str(word_num) + "]").center(35, "="))
        print("-"*35 + "\n", "%-16s %s %16s" % ("Word", "|", "Count"), "\n", "-"*35)
        for count, word in input_list:
            print("%-16s %s %16d" % (word, "|", count))
    
    
    def word_count(_file, max_to_min=False):
        txt = open(_file, "rU")
        word_dict = {}
        for line in txt:
            if line.replace(" ", "") != ("\n" or None):
                process_line(filter(None, split("[^a-zA-Z']+", line.lower())), word_dict)
    
        txt.close()
        final_list = process_dict(word_dict)
        format_print(final_list, max_to_min, len(word_dict))
    
    
    word_count("C:\\your_path_here\\Test.txt", True)
    
    
    #########################################################
    
    
    from collections import Counter
    import re
    
    def openfile(filename):
        fh = open(filename, "r+")
        str = fh.read()
        fh.close()
        return str
    
    def removegarbage(str):
        # Replace one or more non-word (non-alphanumeric) chars with a space
        str = re.sub(r'\W+', ' ', str)
        str = str.lower()
        return str
    
    def getwordbins(words):
        cnt = Counter()
        for word in words:
            cnt[word] += 1
        return cnt
    
    def main(filename, topwords):
        txt = openfile(filename)
        txt = removegarbage(txt)
        words = txt.split(' ')
        bins = getwordbins(words)
        for key, value in bins.most_common(topwords):
            print(key,value)
    
    main('C:\\your_path_here\\Test.txt', 500)
    

    这是一种比较两个文本文件并保留共同元素的方法。

    with open('C:\\your_path_here\\text1.txt', 'r') as file1:
        with open('C:\\your_path_here\\text2.txt', 'r') as file2:
            same = set(file1).intersection(file2)
    
    same.discard('\n')
    
    with open('C:\\your_path_here\\some_output_file.txt', 'w') as file_out:
        for line in same:
            file_out.write(line)
    
    # For differences, use the code below:
    with open('C:\\your_path_here\\text1.txt', 'r') as file1:
        with open('C:\\your_path_here\\text2.txt', 'r') as file2:
            same = set(file1).symmetric_difference(file2)
    
    same.discard('\n')
    
    with open('C:\\your_path_here\\some_output_file.txt', 'w') as file_out:
        for line in same:
            file_out.write(line)
    

    【讨论】:

      【解决方案2】:

      您的值final 仅包含唯一单词(每个单词一个),这就是Counter 仅显示1 次的原因。您需要使用这组词过滤passage_text,并将过滤后的列表传递给计数器:

      import re
      from collections import Counter
      
      passage = '''
          Foo and Bar and Baz or Longword
          Bar or Baz
          Foo foo foo
      '''
      
      common = '''and or'''
      
      words = re.findall(r'\w+', passage)
      common_words = re.findall(r'\w+', common)
      passage_text = [words.lower() for words in words]
      final_set = set(passage_text) - set(common_words)
      word_count = Counter([w for w in passage_text if w in final_set])
      for word, count in sorted(word_count.items(), key=lambda k: -k[1]): # or word_count.most_common()
          print(word, ":", count)
      

      打印:

      foo : 4
      bar : 2
      baz : 2
      longword : 1
      

      【讨论】:

      • 感谢您的提醒。但是如何,大文本文件输入的结果不是降序的
      • @KrysNuvadga 我更新了答案。您只需对结果进行排序或使用Counter.most_common() 方法。
      猜你喜欢
      • 2013-02-15
      • 1970-01-01
      • 1970-01-01
      • 2011-01-10
      • 1970-01-01
      • 1970-01-01
      • 2011-08-19
      • 2023-03-28
      • 1970-01-01
      相关资源
      最近更新 更多