从文件中读取时计算一行中第一个单词出现的次数，但有异常答案

【问题标题】：Counting the number of times the first word in a line appears when read from file with exceptions从文件中读取时计算一行中第一个单词出现的次数，但有异常
【发布时间】：2018-05-24 06:11:52
【问题描述】：

使用具有以下内容的虚拟文件 (streamt.txt)：

andrew I hate mondays.
fred Python is cool.
fred Ko Ko Bop Ko Ko Bop Ko Ko Bop for ever
andrew @fred no it isn't, what do you think @john???
judy @fred enough with the k-pop
judy RT @fred Python is cool.
andrew RT @judy @fred enough with the k pop
george RT @fred Python is cool.
andrew DM @john Oops
john DM @andrew Who are you go away! Do you know him, @judy?

每行的第一个单词代表一个用户，其余的行是一条消息，类似于 twitter。我需要在他们发送的消息数量旁边打印一个列表，列出前 n 个（由用户输入）原始发帖用户（大多数消息）。

这不包括任何以“RT”开头的消息。在平局的情况下，按字典顺序在对齐的列中格式化。

就目前而言，我的代码仅查找消息中最常用的单词，并且不排除 RT 和 DM 消息或占 n：

file=open('streamt.txt')

counts=dict()
for line in file:
    words=line.split()
    for word in words:
    counts[word]=counts.get(word, 0)+1

lst=list()
for key,value in counts.items():
    new=(value, key)
    lst.append(new)

lst=sorted (lst, reverse=True)

for value, key in lst[:10]:
    print(value,key)

这是我的输出：

6 Ko
5 @fred
4 andrew
3 you
3 is
3 cool.
3 RT
3 Python
3 Bop
2 with

实际输出应该是：

Enter n: 10
3 andrew
2 fred
1 john judy

关于我应该如何做到这一点的任何想法？

【问题讨论】：

您的文本文件是否与您的实际输出一致？ andrew 不应该按照您的指示为 2 吗？
是的，我之前注意到了。那是我得到的输出解决方案，但我想它实际上应该是 2
你得到的输出解决方案？看来您甚至没有检查输出是否与说明匹配。
原来只有 RT 需要排除

标签： python string file dictionary count

【解决方案1】：

使用Counter：

from collections import Counter

with open(filename, "r") as f:
    for line in f:
        if 'DM' not in line and 'RT' not in line:
            words = line.split()
            lst.append(words[0])

for k, v in Counter(lst).items():
    print(v, k)

# 2 andrew
# 2 fred                                                     
# 1 judy

【讨论】：

【解决方案2】：

计数如下：

#!/usr/bin/env python3.6
from collections import Counter, defaultdict
from pathlib import Path

def main():
    n = input('Enter n: ')
    try:
        n = int(n)
    except:
        print('Invalid input.')
        return
    ss = Path('streamt.txt').read_text().strip().split('\n')
    c = Counter([
        i.strip().split(' ', 1)[0] for i in ss
        if i.strip().split(' ', 2)[1] not in ('RT',)
    ])
    d = defaultdict(list)
    for k, v in c.most_common():
        d[v].append(k)
    print('\n'.join([f'{k} {" ".join(v)}' for k, v in list(d.items())[:n]]))

if __name__ == '__main__':
    main()

输出：

Enter n: 10
3 andrew
2 fred
1 judy john

【讨论】：

【解决方案3】：

使用 collections 模块。

演示：

import collections
d = collections.defaultdict(int)
with open(filename, "r") as infile:
    for line in infile:
        if 'RT' not in line and 'DM' not in line:
            d[line.strip().split()[0]] += 1

d = sorted(d.items(), key=lambda x: x[1], reverse=True)
for k,v in d:
    print(v, k)

输出：

2 andrew
2 fred
1 judy

【讨论】：

它给出了正确的答案，但它的格式不正确
for k,v in d.items(): print(v, k) ?
知道如何将领带彼此相邻并按降序打印吗？
更新了 sn-p。
你会将 n 变量放在哪里？

【解决方案4】：

这是一个仅使用 defaultdict 作为导入类的完整解决方案。请注意，它考虑了多个用户可能拥有相同数量的消息这一事实：

from collections import defaultdict

n = int(input("Enter n: "))

# Build dictionary with key = name / value = number of messages
d = defaultdict(int)
with open('file.txt') as file:
    for line in file:
        words = line.split()
        if words[1] not in ["RT"]:
            d[words[0]] += 1

# Build dictionary with key = number of messages / value = list of names
d_new = defaultdict(list)
for k,v in d.items():
    d_new[v].append(k)

# Keep only the top n items in dictionary sorted by number of messages
listOfNbAndNames = sorted(d_new.items(), reverse = True)[:n]
for nb,names in listOfNbAndNames:
    print(nb, " ".join(names))

【讨论】：

【解决方案5】：

这可以通过使用str.split 恢复作者的用户名并使用collections.Counter 保持计数来有效地完成。

from collections import Counter

with open('streamt.txt', 'r') as file:
    count = Counter(line.split()[0] for line in file)

print(count) # Counter({'andrew': 4, 'fred': 2, 'judy': 2, 'george': 1, 'john': 1})

如果您希望用户按消息数量排序，则可以使用Counter.most_common。您可以选择将要返回的项目数作为参数传递。

print(count.most_common())
# prints:  [('andrew', 4), ('fred', 2), ('judy', 2), ('george', 1), ('john', 1)]

print(count.most_common(2))
# prints:  [('andrew', 4), ('fred', 2)]

【讨论】：