计算给定文本中每个单词的频率[关闭]答案

【问题标题】：Counting the frequency of each word in a given text [closed]计算给定文本中每个单词的频率[关闭]
【发布时间】：2013-06-07 12:12:44
【问题描述】：

我正在寻找一个 python 程序来计算文本中每个单词的频率，并输出每个单词及其出现的计数和行号。
我们将单词定义为连续的非空白字符序列。（提示：split()）

注意：同一个字符序列的不同大小写应该被认为是同一个单词，例如Python和python，我和我。

输入将是几行，空行终止文本。输入中只会出现字母字符和空格。

输出格式如下：
每行以一个表示单词出现频率的数字、一个空格、然后是单词本身以及包含该单词的行号列表开头。

示例输入

Python is a cool language but OCaml
is even cooler since it is purely functional

样本输出

3 is 1 2
1 a 1
1 but 1
1 cool 1
1 cooler 2
1 even 2
1 functional 2
1 it 2
1 language 1
1 ocaml 1
1 purely 2
1 python 1
1 since 2

附言。我不是学生，我正在自学 Python..

【问题讨论】：

你的问题是什么？
一个程序，它计算文本中每个单词的频率，并输出每个单词及其出现的计数和行号。
您自己尝试过吗？如果是这样，请发布您的代码并解释您遇到的问题。如果您证明自己付出了努力，人们往往会变得更好，并提供回答。提示：检查with-statement 和collections-module。
我们应该用 hint: `split() 做什么？您是如何遇到这个问题的？
我对此投了反对票，因为这不是一个真正的问题——你只是要求整个代码，准备好运行。你还没有尝试过自己做。这是一个“帮助”论坛，而不是代码工厂。

标签： python string

【解决方案1】：

使用collections.defaultdict、collections.Counter 和string formatting：

from collections import Counter, defaultdict

data = """Python is a cool language but OCaml
is even cooler since it is purely functional"""

result = defaultdict(lambda: [0, []])
for i, l in enumerate(data.splitlines()):
    for k, v in Counter(l.split()).items():
        result[k][0] += v
        result[k][1].append(i+1)

for k, v in result.items():
    print('{1} {0} {2}'.format(k, *v))

输出：

1 自 [2] 3 是 [1, 2] 1个 [1] 1 它 [2] 1 但 [1] 1 纯粹 [2] 1 个冷却器 [2] 1 个功能 [2] 1 蟒蛇 [1] 1 酷 [1] 1 种语言 [1] 1 偶数 [2] 1 OCaml [1]

如果顺序很重要，您可以这样对结果进行排序：

items = sorted(result.items(), key=lambda t: (-t[1][0], t[0].lower()))
for k, v in items:
    print('{1} {0} {2}'.format(k, *v))

输出：

3 是 [1, 2] 1个 [1] 1 但 [1] 1 酷 [1] 1 个冷却器 [2] 1 偶数 [2] 1 个功能 [2] 1 它 [2] 1 种语言 [1] 1 OCaml [1] 1 纯粹 [2] 1 蟒蛇 [1] 1 自 [2]

【讨论】：

如何将此结果写入 csv 文件？

【解决方案2】：

频率列表通常最好用counter 解决。

from collections import Counter
word_count = Counter()
with open('input', 'r') as f:
    for line in f:
        for word in line.split(" "):
            word_count[word.strip().lower()] += 1

for word, count in word_count.iteritems():
    print "word: {}, count: {}".format(word, count)

【讨论】：

line.split() 默认在空格上运行，因此不需要 " " 部分。但是您也许应该等待提供完整的答案，因为 OP 从未表现出任何努力。不过答案很好:-)

【解决方案3】：

好的，所以您已经确定了将字符串转换为单词列表的 split。但是，您想列出每个单词出现的行，因此您应该先将字符串拆分为行，然后再拆分为单词。然后，您可以创建一个字典，其中键是单词（首先要小写），值可以是包含出现次数和出现行数的结构。

您可能还需要输入一些代码来检查某些内容是否为有效单词（例如，它是否包含数字），并清理单词（删除标点符号）。我会把这些留给你。

def wsort(item):
    # sort descending by count, then ascending alphabetically
    word, freq = item
    return -freq['count'], word

def wfreq(str):
    words = {}

    # split by line, then by word
    lines = [line.split() for line in str.split('\n')]

    for i in range(len(lines)):
        for word in lines[i]:
            # if the word is not in the dictionary, create the entry
            word = word.lower()
            if word not in words:
                words[word] = {'count':0, 'lines':set()}

            # update the count and add the line number to the set
            words[word]['count'] += 1
            words[word]['lines'].add(i+1)

    # convert from a dictionary to a sorted list using wsort to give the order
    return sorted(words.iteritems(), key=wsort)

inp = "Python is a cool language but OCaml\nis even cooler since it is purely functional"

for word, freq in wfreq(inp):
    # generate the desired list format
    lines = " ".join(str(l) for l in list(freq['lines']))
    print "%i %s %s" % (freq['count'], word, lines)

这应该提供与您的示例完全相同的输出：

3 is 1 2
1 a 1
1 but 1
1 cool 1
1 cooler 2
1 even 2
1 functional 2
1 it 2
1 language 1
1 ocaml 1
1 purely 2
1 python 1
1 since 2

【讨论】：

为伟大的 cmets +1。

【解决方案4】：

首先找到文本中出现的所有单词。使用split()。

如果文本存在于文件中，那么我们将首先将其转换为字符串，并将其全部转换为text。同时删除文本中的所有\n。

filin=open('file','r')
di = readlines(filin)

text = ''
for i in di:
     text += i</pre></code>

现在检查每个单词在文本中出现的次数。我们稍后会处理行号。

dicts = {}
for i in words_list:
     dicts[i] = 0
for i in words_list:
    for j in range(len(text)):
        if text[j:j+len(i)] == i:
            dicts[i] += 1

现在我们有了一个字典，其中单词作为键，值是单词在文本中出现的次数。

现在是行号：

dicts2 = {}
for i in words_list:
     dicts2[i] = 0
filin.seek(0)
for i in word_list:
    filin.seek(0)
    count = 1
    for j in filin:
        if i in j:
            dicts2[i] += (count,)
         count += 1

现在 dicts2 将单词作为键，并将其所在的行号列表作为值。在一个元组内

如果数据已经在字符串中，则只需删除所有\ns。

di = split(string_containing_text,'\n')

其他一切都将相同。

我相信你可以格式化输出。

【讨论】：

我很抱歉格式错误。我不在电脑上。我希望你能够明白。如果没有，请发表评论。
我修正了你的格式，现在你必须接受它......