如何将python文本中句子的每个开头都大写？ [复制]答案

【问题标题】：How to capitalize every beginning of a sentence in a text in python? [duplicate]如何将python文本中句子的每个开头都大写？ [复制]
【发布时间】：2025-12-18 00:45:02
【问题描述】：

我想创建一个函数，该函数将一个文本字符串作为输入，并且我想将标点符号后的每个字母大写。问题是，字符串不像列表那样工作，所以我真的不知道该怎么做，我试图这样做，但它似乎不起作用：

def capitalize(strin):
    listrin=list(strin)
    listrin[0]=listrin[0].upper()
    ponctuation=['.','!','?']
    strout=''
    for x in range (len(listrin)):
        if listrin[x] in ponctuation:
            if x!=len(listrin):
                if listrin[x+1]!=" ":
                    listrin[x+1]=listrin[x+1].upper()
                elif listrin[x+2]!=" ":
                    listrin[x+1]=listrin[x+1].upper()
    for y in range(len(listrin)):
        strout=strout+listrin[y]
    return strout

目前，我正在尝试使用以下字符串解决它：'hello! how are you? please remember capitalization. EVERY time.'

【问题讨论】：

标签： python capitalization

【解决方案1】：

我使用正则表达式来做到这一点。

>>> import re
>>> line = 'hi. hello!   how are you?  fine!  me too, haha. haha.'
>>> re.sub(r"(?:^|(?:[.!?]\s+))(.)",lambda m: m.group(0).upper(), line)
'Hi. Hello!   How are you?  Fine!  Me too, haha. Haha.'

【讨论】：

找到句子的第一个非空格字符，然后将其放在上面。重复所有的句子。句子在一段的开头，或以“.!?”开头和一些空格。
我是想让 Sammy 解释一下 :)
是的，我教他解释。 :)
非常感谢您的回答，这对我很有帮助，尽管我不太了解发生了什么，但我会尝试更深入地研究它。

【解决方案2】：

最基本的方法是根据标点符号分割句子，然后你会得到一个列表。然后循环进入列表的所有项目，将它们剥离（）然后大写（）它们。像下面这样的东西可能会解决你的问题：

import re
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentence = re.split(pass_your_punctuation_list_here, input_sen)
    for i in sentence:
        print(i.strip().capitalize(), end='')

不过最好使用 nltk 库：

from nltk.tokenize import sent_tokenize
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentences = sent_tokenize(input_sen)
sentences = [sent.capitalize() for sent in sentences]
print(sentences)

使用 NLTK 库或其他 NLP 库比手动编写规则和正则表达式更好，因为它处理了许多我们不考虑的情况。解决了句界消歧问题。

句子边界消歧（SBD），又称句子打破，是决定的自然语言处理中的问题句子开始和结束的地方。通常是自然语言处理工具要求将他们的输入分成多个句子原因。然而，句子边界识别具有挑战性因为标点符号常常是模棱两可的。例如，一个时期可以表示缩写、小数点、省略号或电子邮件地址——不是句子的结尾。约 47% 的时期华尔街日报语料库表示缩写。还有，问标记和感叹号可能出现在嵌入的引用中，表情符号、计算机代码和俚语。日语等语言中文有明确的句尾标记。

希望对你有帮助。

【讨论】：

非常感谢！什么是代币化？
我更新了代码并使用不同的方法将输入分解为句子。 Tokenizers 用于将字符串划分为子字符串列表。例如，sent_tokenize 用于查找句子列表。 @SammySteffensen