Python计算拆分句子的单词？答案

【问题标题】：Python count words of split sentence?Python计算拆分句子的单词？
【发布时间】：2020-09-11 16:48:53
【问题描述】：

不确定如何删除输出末尾的“\n”

基本上，我有这个 txt 文件，其中包含以下句子：

"What does Bessie say I have done?" I asked.

"Jane, I don't like cavillers or questioners; besides, there is something truly forbidding in a child 
 taking up her elders in that manner.
 
Be seated somewhere; and until you can speak pleasantly, remain silent."

我设法用分号用代码分割句子：

import re
with open("testing.txt") as file:
read_file = file.readlines()
for i, word in enumerate(read_file):
    low = word.lower()
    re.split(';',low)

但不确定如何计算拆分句子的单词，因为 len() 不起作用：句子的输出：

['"what does bessie say i have done?" i asked.\n']
['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a 
child taking up her elders in that manner.\n']
['be seated somewhere', ' and until you can speak pleasantly, remain silent."\n']

例如第三句，我想数左边3个字，右边8个字。

感谢阅读！

【问题讨论】：

你不能只用空白分割并得到结果列表的长度吗？
这能回答你的问题吗？ Count Words in Python
结帐.splitlines()
正则表达式也有像 \b 和 \w 这样的东西，它们可能会对你有所帮助。您应该举例说明您的目标是作为此类数据的结果。

标签： python python-3.x nlp

【解决方案1】：

字数是空格数加一：

例如两个空格，三个字：

世界很美好

代码：

import re
import string

lines = []
with open('file.txt', 'r') as f:
    lines = f.readlines()

DELIMETER = ';'
word_count = []
for i, sentence in enumerate(lines):
    # Remove empty sentance
    if not sentence.strip():
        continue
    # Remove punctuation besides our delimiter ';'
    sentence = sentence.translate(str.maketrans('', '', string.punctuation.replace(DELIMETER, '')))
    # Split by our delimeter
    splitted = re.split(DELIMETER, sentence)
    # The number of words is the number of spaces plus one
    word_count.append([1 + x.strip().count(' ') for x in splitted])

# [[9], [7, 9], [7], [3, 8]]
print(word_count)

【讨论】：

【解决方案2】：

使用str.rstrip('\n') 删除每个句子末尾的\n。

要统计一个句子中的单词，可以使用len(sentence.split(' '))

要将句子列表转换为计数列表，您可以使用map 函数。

原来是这样：

import re

with open("testing.txt") as file:
    for i, line in enumerate(file.readlines()):
        # Ignore empty lines
        if line.strip(' ') != '\n':
            line = line.lower()
            # Split by semicolons
            parts = re.split(';', line)
            print("SENTENCES:", parts)
            counts = list(map(lambda part: len(part.split()), parts))
            print("COUNTS:", counts)

输出

SENTENCES: ['"what does bessie say i have done?" i asked.']
COUNTS: [9]
SENTENCES: ['"jane, i don\'t like cavillers or questioners', ' besides, there is something truly forbidding in a child ']
COUNTS: [7, 9]
SENTENCES: [' taking up her elders in that manner.']
COUNTS: [7]
SENTENCES: ['be seated somewhere', ' and until you can speak pleasantly, remain silent."']
COUNTS: [3, 8]

【讨论】：

【解决方案3】：

你需要图书馆 nltk

from nltk import sent_tokenize, word_tokenize

mytext = """I have a dog. 
The dog is called Bob."""

for sent in sent_tokenize(mytext): 
    print(len(word_tokenize(sent)))

输出

5
6

分步说明：

for sent in sent_tokenize(mytext): 
    print('Sentence >>>',sent) 
    print('List of words >>>',word_tokenize(sent)) 
    print('Count words per sentence>>>', len(word_tokenize(sent)))

输出：

Sentence >>> I have a dog.
List of words >>> ['I', 'have', 'a', 'dog', '.']
Count words per sentence>>> 5
Sentence >>> The dog is called Bob.
List of words >>> ['The', 'dog', 'is', 'called', 'Bob', '.']
Count words per sentence>>> 6

【讨论】：

【解决方案4】：

`

import re
sentences = []                                                   #empty list for storing result
with open('testtext.txt') as fileObj:
    lines = [line.strip() for line in fileObj if line.strip()]   #makin list of lines allready striped from '\n's
for line in lines:
    sentences += re.split(';', line)                             #spliting lines by ';' and store result in sentences
for sentence in sentences:
    print(sentence +' ' + str(len(sentence.split())))            #out

【讨论】：

【解决方案5】：

试试这个：

import re
  with open("testing.txt") as file:
  read_file = file.readlines()
  for i, word in enumerate(read_file):
  low = word.lower()
  low = low.strip()
  low = low.replace('\n', '')
  re.split(';',low)

【讨论】：

为什么strip 两次然后然后删除\n？此外，re.split 的结果不会去任何地方。