【问题标题】:Regex splitting of multiple grouped delimeters多个分组分隔符的正则表达式拆分
【发布时间】:2021-12-27 18:16:48
【问题描述】:

如何对分隔符组合进行分组,例如1.2)

例如,给定一个像'1. I like food! 2. She likes 2 baloons.'这样的字符串,你怎么能把这样的句子分开?

另一个例子,给定输入

'1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'

输出应该是

['3D Technical', 'Process animations', 'Explained videos', 'Product launch videos']

我试过了:

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
re.split(r'[1.2.3.,1)2)3)/]+|etc', a)

输出是:

['',
 'D Technical',
 'Process animations',
 ' Explainer videos',
 ' Product launch videos']

【问题讨论】:

  • /在句子中唯一要拆分的字符吗?
  • 总是有逗号吗?
  • 您的第一个示例比您实际需要的更通用。你真的需要处理1.吗?
  • etc 在你的正则表达式中是什么意思?
  • etc 只是为了避免如果有任何以etc结尾的句子。

标签: python python-re


【解决方案1】:

这是获得预期结果的一种方法:

import re

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
r = [s for s in map(str.strip,re.split(r',? *[0-9]+(?:\)|\.) ?', a)) if s]

print(*r,sep='\n')
3D Technical/Process animations
Explainer videos
Product launch videos
  • 分隔符的模式r',? *[0-9]+(?:\)|\.) ?' 可以分解如下:
    • ,? 一个可选的尾随逗号
    • * 数字前的可选空格(或多个)
    • [0-9]+ 至少一个数字的序列
    • (?:\)|\.) 后跟右括号或句点。开头的 ?: 使其成为非捕获组,因此 re.split 不会将其包含在输出中
    • ? 括号或句点后的可选空格(您可能需要删除 ? 或将其替换为 +,具体取决于您的实际数据

re.split 的输出被映射到 str.strip 以删除前导/尾随空格。这是一个列表推导式,它将过滤掉空字符串(例如,在第一个分隔符之前)

如果没有编号的逗号或斜杠也用作分隔符,您可以将其添加到模式中:

def splitItems(a):
    pattern = r'/|,|(?:,? *[0-9]+(?:\)|\.) ?)'
    return [s for s in map(str.strip,re.split(pattern, a)) if s]

输出:

a = '3D Technical/Process animations, Explainer videos, Product launch videos'
print(*splitItems(a),sep='\n')

3D Technical/Process animations
Explainer videos
Product launch videos


a = '1. Hello 2. Hi'
print(*splitItems(a),sep='\n')
Hello
Hi

a = "Great, what's up?! , Awesome"
print(*splitItems(a),sep='\n')
Great
what's up?!
Awesome

a = '1. Medicines2. Devices 3.Products'
print(*splitItems(a),sep='\n')
Medicines
Devices
Products

a = 'ABC/DEF/FGH'
print(*splitItems(a),sep='\n')
ABC
DEF
FGH

如果您的分隔符是非此即彼模式的列表(意味着只有一个模式始终适用于给定字符串),那么您可以在循环中按优先顺序尝试它们并返回产生多个部分的第一个拆分:

def splitItems(a):
    for pattern in ( r'(?:,? *[0-9]+(?:\)|\.) ?)', r',', r'/' ):
        result = [*map(str.strip,re.split(pattern, a))]
        if len(result)>1: break
    return [s for s in result if s]

输出:

# same as all the above and this one:

a = '1. Arrangement of Loans for Listed Corporates and their Group Companies, 2. Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc 3. Estate Planning'
print(*splitItems(a),sep='\n')

Arrangement of Loans for Listed Corporates and their Group Companies
Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc
Estate Planning

【讨论】:

  • 您能否编辑您的答案并尽可能简单地解释您的代码?
  • '3D Technical/Process animations, Explainer videos, Product launch videos' 这不适用于本示例,请检查一下。
  • 有3种数据,1. Hello 2. HiGreat, what's up?! , Awesome1. Medicines2. Devices 3.Products。此外,ABC/DEF/FGH 等数据
  • 好吧,不是想混淆你,但请修改代码,使其适用于该答案的第一条评论中的句子以及你的答案中的句子。它将解决我原来问题中 90% 的问题。 :p
  • 天啊,新编辑的代码很好用!!最后一件事:'1. Arrangement of Loans for Listed Corporates and their Group Companies, 2. Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc 3. Estate Planning' 你能不能修改代码,像上面那样用数字前缀分割句子。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-06-29
  • 2012-06-29
  • 2020-10-26
  • 1970-01-01
相关资源
最近更新 更多