【问题标题】:extract substrings using python regex使用 python 正则表达式提取子字符串
【发布时间】:2021-10-14 06:10:01
【问题描述】:

我想使用一个正则表达式来匹配两个字符串之间的任何文本:

   sample_string= "Message ID: SM9MatRNTnMAYaylR0QgOH///qUUveBCbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    john.s@xy.com:  
     [EVENT] 347376954900491 (john.s@xy.com) created room
    (roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
    READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
    roomType=PRIVATE conversationScope=internal owningCompany=X Y
    Bank)
    
    Message ID: nsabNaqeXfuEj9mBEhvS0n///qUUveAhbw==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    john.s@xy.comsays  
     [EVENT] 347376954900491 (john.s@xy.com) invited 347376954900486
    (kerren.n@xy.com) to room (CSTest|john s|16091907435583)
    
    Message ID: Nu/EYTkTQ5qdbqzZ0Rig8n///qUUvQ42dA==  
    2021-07-10T20:48:23.997Z john s (X Y Bank) -
    john.s@xy.comsays  
    
    Catchyou later
    
      
    
    Message ID: dy2yaByqhm+n88Gd3VQOhH///qUUrz8odA==  
    2021-07-10T20:48:23.997Z kerren n (X Y Bank) -
    nancy.n@xy.comsays  
    
    KeywordContent_ Cricket is a bat-and-ball game played between two teams of
    eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
    with a wicket at each end, each comprising two bails balanced on three stumps.
    The batting side scores runs by striking the ball bowled at the wicket with
    the bat, while the bowling and fielding side tries to prevent this and dismiss
    each player (so they are "out").
    
      
    
    * * *
    
    Generated by Content Export Service | Stream Type: SymphonyPost |
    Stream ID: ZZo5pRRPFC18uzlonFjya3///qUUveBHdA== | Room Type: Private |
    Conversation Scope: internal | Owning Company: X Y Bank | File
    Generated Date: 2021-07-10T20:48:23.997Z | Content Start Date:
    2021-07-10T20:48:23.997Z | Content Stop Date: 2021-07-10T20:48:23.997Z  
    
    * * *
    
    *** (780787) Disclaimer: 
    (incorporated in paris with Ref. No. ZC18, is authorised by Prudential Regulation
    Authority (PRA) and regulated by Financial Conduct Authority and PRA. oyp and
    its affiliates (We) monitor this confidential message meant for your
    information only. We make no recommendation or offer. You should get
    independent advice. We accept no liability for loss caused hereby. See market
    commentary disclaimers (
    http://wholesalebanking.com/en/utility/Pages/d-mkt.aspx ),
    Dodd-Frank and EMIR disclosures (
    http://wholesalebanking.com/en/capabilities/financialmarkets/Pages/default.aspx
    ) "

在这个例子中,我想提取emailID 和关键字Messaage ID: 之后的所有内容 所以预期的输出是:

extracted_list =[':  
 [EVENT] 347376954900491 (john.s@xy.com) created room
(roomName='CSTest' roomDescription='CS Test Chat Room' COPY_DISABLED=false
READ_ONLY=false DISCOVERABLE=false MEMBER_ADD_USER_ENABLED=false
roomType=PRIVATE conversationScope=internal owningCompany=X Y
Bank)','says  
 [EVENT] 347376954900491 (john.s@xy.com) invited 347376954900486
(kerren.n@xy.com) to room (CSTest|john s|16091907435583)','says Catchyou later','says 
KeywordContent_ Cricket is a bat-and-ball game played between two teams of
eleven players on a field at the centre of which is a 20-metre (22-yard) pitch
with a wicket at each end, each comprising two bails balanced on three stumps.
The batting side scores runs by striking the ball bowled at the wicket with
the bat, while the bowling and fielding side tries to prevent this and dismiss
each player (so they are "out").']

注意:最后***后的所有内容都不是文本的一部分

到目前为止我尝试的是:

text = re.findall(r'\S+@\S+\s+(.*)Message ID', sample_string)
print (text)
##output: []

【问题讨论】:

  • 所以,基本上你的问题是:如何提取文本(字符串)的一部分,从emailIDMessaage ID?总是尽量提供一个最小的例子,而不是一大堆文字。
  • @MarkusWeninger 是的,抱歉我刚开始使用这个平台。
  • 那里应该有emailID吗?
  • @Jesper emailID 紧跟在(X Y Bank) - 之后
  • 我想你的意思是[^\s@]+@[^\s@]+\s(.*?)\bMessage ID\b regex101.com/r/zd5w8v/1 但是你必须添加re.DOTALL作为re.findall的最后一个参数

标签: python-3.x regex


【解决方案1】:

你可以使用

(?s)\S+@\S+?((?:says?|:)?\s.*?)\s+(?:Message ID|\* +\* +\*)

请参阅regex demo

详情

  • (?s) - 与 re.DOTALL 相同,内联修饰符使 . 跨换行符匹配
  • \S+ - 一个或多个非空白字符(可以替换为[^\s@]+
  • @ - 一个 @ 字符
  • \S+? - 尽可能少的一个或多个非空白字符
  • ((?:says?|:)?\s.*?) - 第 1 组:可选的 says/say/:,然后是空格,然后是尽可能少的零个或多个字符
  • \s+ - 一个或多个空格
  • (?:Message ID|\* +\* +\*) - Message ID* * * 类似子字符串。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-08-28
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多