【问题标题】:Split the text in paragraphs将文本拆分为段落
【发布时间】:2018-11-16 23:57:23
【问题描述】:

我知道我可以使用这样的东西

theText='She loves music. Her favorit instrument is the piano.\n\n However, \n\n she does not play it.'
paragraphs = [p for p in theText.split('\n\n') if p]
for i,p in enumerate(paragraphs):
    print(i,p)

将文本分成段落。

但是,我想添加一个附加条件,即下一个句子不能以小写字母开头。 实际代码提供

0 She loves music. Her favorit instrument is the piano.
1  However, 
2  she does not play it.

我愿意

0 She loves music. Her favorit instrument is the piano.
1  However, she does not play it.

我认为我应该使用一些正则表达式,但我无法找出正确的结构。

【问题讨论】:

    标签: regex string python-3.x nlp


    【解决方案1】:

    您可以使用以下正则表达式,它使用Lookahead ?= 确保您的\n\n 后跟一个大写字母(和一个可选空格)。此外,在你的枚举中,你必须摆脱你的\n\n(这里,使用re.sub):

    import re
    paragraphs = re.split('\n\n\s?(?=[A-Z])',theText)
    for i,p in enumerate(paragraphs):
        print(i,re.sub('\n\n\s?','',p))
    
    0 She loves music. Her favorit instrument is the piano.
    1 However, she does not play it.
    

    【讨论】:

    • re.split(r'\n*',re.sub(r'\n*(?= [a-z])','',theText))
    猜你喜欢
    • 1970-01-01
    • 2021-11-18
    • 2022-01-14
    • 1970-01-01
    • 2020-07-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-02-21
    相关资源
    最近更新 更多