【问题标题】:Want Regex to stop at first occurrence of "." and ";"希望正则表达式在第一次出现“。”时停止。和 ”;”
【发布时间】:2014-06-13 11:56:05
【问题描述】:

我正在尝试从段落中提取句子,使用类似的模式

 Current. time is six thirty at Scotland. Past. time was five thirty at India; Current. time is five thirty at Scotland. Past. time was five thirty at Scotland. Current. time is five ten at Scotland.

当我使用正则表达式时

/current\..*scotland\./i

这匹配所有字符串

Current. time is six thirty at Scotland. Past. time was six thirty at India; Current. time is five thirty at Scotland. Past. time was five thirty at Scotland. Current. time is five ten at Scotland.

相反,我想在第一次出现“。”时停止。到所有捕获组,如

 Current. time is six thirty at Scotland.
 Current. time is five ten at Scotland. 

类似的文字

 Past. time was five thirty at India; Current. time is six thirty at Scotland. Past. time was five thirty at Scotland. Past. time was five ten at India;    

当我使用正则表达式时

 /past\..*india\;/i

这将匹配整个字符串

 Past. time was five thirty at India; Current. time is six thirty at Scotland. Past. time was five thirty at Scotland. Past. time was five ten at India; 

在这里我想捕获所有组或第一组,如以下,以及如何在第一次出现“;”时停止

Past. time was five thirty at India; 
Past. time was five ten at India; 

如何使正则表达式停在“,”或“;”有上面的例子吗?

【问题讨论】:

  • 你的模式是贪婪的,会尽可能匹配。通过附加? 使其变得懒惰,因此它看起来像/current\..*?scotland\./i

标签: ruby regex ruby-on-rails-3 nlp


【解决方案1】:

您不应该对正则表达式做一些事情,首先,正如 Arnal Murali 所指出的,您不应该使用贪婪的正则表达式,而应该使用惰性版本:

/current\..*?scotland\./i

我认为首先选择惰性选项是正则表达式的一般规则,因为它通常是您想要的。其次,您并不想使用. 来匹配所有内容,因为您不想让这部分正则表达式匹配.;,您可以将它们放在负捕获组中进行捕获除了他们之外的任何东西:

/current\.[^.]*?scotland\./i

/current\.[^;]*?india;/i

或同时覆盖:

/(current|past)\.[^.;]*?(india|scotland)[.;]/i

(显然这可能不是你想要做的,只是包括演示如何扩展它)

这也是一个很好的经验法则,如果您在使用正则表达式时遇到问题,请使任何通配符更加具体(在这种情况下,从匹配所有 . 更改为匹配除 .; 之外的所有内容 @ 987654331@)

【讨论】:

  • 写得很好的答案 +1 :)
  • @zx81 谢谢。总是很高兴听到。 :)
【解决方案2】:
s = ""Current. time is six thirty at Scotland. Past. time..."
s.scan /[Current|Past]*\..*?[.|;]/i 

#=> ["Current. time is six thirty at Scotland.", "Past. time was five thirty at India;",...]

【讨论】:

    【解决方案3】:

    正如 Amal 所说,您的模式是贪婪的,您应该附加一个 ?让它变得懒惰。我将使用以下内容仅获取您要求的字符串的第一次出现:

    /^.*?current\..*?scotland\./i
    

    这是为了让每个组都遵循该模式,同时考虑到';'以及'.':

    /current\..*?scotland[.;]/i
    

    最后一个基本意思是:找到任何出现的 'current' 并在到达第一个 'scotland' 时停止,后跟一个 '.'或';'

    【讨论】:

    • 无需在字符类中转义.;。所以/current\..*?scotland[.;]/i 就足够了。
    • 你是对的。我只是在一大早变得更加冗长:) 已编辑。
    猜你喜欢
    • 2021-01-26
    • 2016-09-28
    • 2012-07-03
    • 1970-01-01
    • 2022-12-03
    相关资源
    最近更新 更多