如何在python中的多个句子中计算一个文本中的一个句子中的单词答案

【问题标题】：How to count words in a sentence of a text in multiple sentences in python如何在python中的多个句子中计算一个文本中的一个句子中的单词
【发布时间】：2015-06-15 10:55:21
【问题描述】：

我已经四处寻找解决此问题的方法，但还没有找到。我有一个大文本文件，它被分成几个句子，只用“。”分隔。我需要计算每个句子有多少个单词并将其写入文件。我为这部分代码使用了一个单独的文件，到目前为止我有这个

    tekst = open('father_goriot.txt','r').read()
    tekst = tekst.split('.')

有了这个，我得到一个“列表”类型变量，每个句子都在它自己的索引中。我知道如果我写了

    print len(tekst[0].split())

我得到了第一句话中的单词数。我需要的是某种循环来获取每个句子中的单词数。之后，我需要将这些数据以以下形式写入文件： 1. 文本中句子的索引号， 2. 该特定句子中的单词数， 3. 不同文本中同一句子中的单词数（这是使用单独文件中的代码对第一个文本的翻译), 4. 两个句子共有的词数。有什么想法吗？

【问题讨论】：

标签： python string text words sentence

【解决方案1】：

在搜索了一段时间并找到了一个更简单的解决方案后，我偶然发现了一个代码，它给了我想要的部分结果。每个句子中的单词数。它由一个数字列表表示，如下所示：

    wordcounts = []
    with open('father_goriot.txt') as f:
       text = f.read()
       sentences = text.split('.')
       for sentence in sentences:
           words = sentence.split(' ')
           wordcounts.append(len(words))

但是这个数字是不正确的，因为它还计算了更多的东西。所以对于第一句话，我得到的结果是 40 而不是 38 个单词。我该如何解决这个问题。

【讨论】：

【解决方案2】：

只需枚举整个文件：

import re

with open('data.txt') as data:
    for line, words in enumerate(data):
        args = line + 1, re.split(r'[!?\.\s]+', words) # formatter
        print('Sentence at line {0} has {1} words.'.format(*args))

【讨论】：

感谢您的快速回答，但此位计算每个单词的出现次数。这不是我要找的……
@BLaZZeD 我想我修好了。
我需要每个句子中的单词数。该文本文件由 1548 个具有不同单词数的句子组成。所以我正在寻找一个循环来查找 1548 个句子中每个句子有多少个单词，并以 print("Sentence", sentence_index, " has ", number_of_words, " words." 的形式打印出来。
@BLaZZeD 好了。

【解决方案3】：

要得到一个列表，其中每个项目对应一个句子：

def count_words_per_sentence(filename):
    """
    :type filename: str
    :rtype: list[int]
    """
    with open(filename) as f:
        sentences = f.read().split('.')
    return [len(sentence.split()) for sentence in sentences]

要测试两个句子有多少相同的单词，你应该使用集合操作。例如：

 words_1 = sentence_1.split()
 words_2 = sentence_2.split()
 in_common = set(words_1) & set(words_2)  # set intersection

对于文件 io，请查看 csv 模块和 writer 函数。将您的行构建为列表列表——查看zip——然后将其提供给 csv 写入器。

word_counts_1 = count_words_per_sentence(filename_one)
word_counts_2 = count_words_per_sentence(filename_two)
in_common = count_words_in_common_per_sentence(filename_one, filename_two)
rows = zip(itertools.count(1), word_counts_1, word_counts_2, in_common)
header = [["index", "file_one", "file_two", "in_common"]]
table = header + rows

# https://docs.python.org/2/library/csv.html
with open("my_output_file.csv", 'w') as f:
     writer = csv.writer(f)
     writer.writerows(table)

【讨论】：

在 def count_words_per_sentence() 的括号里我写什么？如果我写了文件名，我会得到一个无效的语法错误......
我不太确定我是否理解。您应该将文件名作为字符串传递给count_words_per_sentence——即count_words_per_sentence("father_goriot.txt")。

【解决方案4】：

您需要遍历文件并逐行读取，如下所示：

file = open('file.txt', 'r')

for line in file:
    do something with the line

【讨论】：