【问题标题】:Regex to "normalize" usage of SPACE after . , : chars (and some exceptions)正则表达式在 . , : 字符(和一些例外)
【发布时间】:2021-12-25 17:40:28
【问题描述】:

关于.,,,: 符号的正确用法,我需要规范一些文本(产品描述)(前后没有空格) p>

我想出的正则表达式是这样的:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])\s*(?!<br />)#', '$1 ', $variation['DESCRIPTION']);

问题在于这匹配了它不应该涉及的四种情况:

  • 任何十进制数,例如 5.5
  • 任何千位分隔符,例如 4,500
  • 希腊语中的“固定”短语ό,τι
  • 省略号符号... - 基本上省略号是一个完全特殊的情况,我认为应该在单独的preg_replace 中处理它,也许?我的意思是,这三个点应该被视为一件事,这意味着some text ... 确实应该匹配并转换为some text...,而不是some text. . .

特别是对于数字异常,我知道可以通过一些负面的前瞻/后视来实现,但不幸的是我无法将它们组合到我当前的模式中。

This 是一个供你检查的小提琴(不应该匹配的情况在第 2、3、4 行)。

编辑:下面发布的两种解决方案都可以正常工作,但最终会在描述的最后一个句号之后添加一个空格。这不是什么大问题,因为在我的代码前面,我在描述的开头和结尾处理了<br />s 和 空格,所以我将这个 preg_replace 移到那个之前...

所以,我最终使用的最终代码是这样的:

$variation['DESCRIPTION'] = preg_replace('#\s*([:,.])(?!(?<=\d.)\d)(?!(?<=ό,)τι)\s*#ui', '$1 ', $variation['DESCRIPTION']);
$variation['DESCRIPTION'] = preg_replace('#^\s*(<br />)*\s*|\s*(<br />)*\s*$#', '', $variation['DESCRIPTION']);

所以剩下的唯一事情就是改变这段代码,让它按照我上面描述的方式处理省略号。

非常感谢您对最后一个要求的任何帮助! TIA

【问题讨论】:

    标签: php regex regex-negation


    【解决方案1】:

    你可以添加两个包含lookbehinds的lookaheads:

    \s*(\.{2,}|[:,.](?!(?<=ό,)τι)(?!(?<=\d.)\d))(?!\s*<br\s*/>)\s*
    

    请参阅regex demo。请注意,如果在:,. 之后的任何零个或多个空格后有 &lt;br/&gt;,我还将\s* 添加到最后一个前瞻中并将其与消耗的\s* 交换以失败匹配.

    详情

    • \s* - 零个或多个空格
    • (\.{2,}|[:,.]) - 第 1 组:两个或多个点,或 :,.
    • (?!(?&lt;=ό,)τι) - 如果接下来的两个字符是 τι 前面有 ό,,则匹配失败
    • (?!(?&lt;=\d.)\d) - 如果下一个字符是前面有数字和任何字符的数字,则匹配失败(请注意,. 就足够了,因为 [:,.] 已经匹配允许/必需的字符,在这里,我们只需要“跳过”匹配的字符)
    • (?!\s*&lt;br\s*/&gt;) - 如果有零个或多个空格&lt;br、零个或多个空格、/&gt; 紧邻当前位置的右侧,则否定前瞻匹配失败。
    • \s* - 零个或多个空格。

    【讨论】:

    【解决方案2】:

    如果 Wiktor 的重环顾模式对您来说太难以概念化/维护/适应,那么也许匹配&忽略技术对您来说会更容易。诚然,Wiktor 的模式针对性能进行了优化。

    图案:

    ~                        #starting pattern delimiter 
    \s*                      #zero or more whitespaces
    (?:                      #start non-capturing group #1
      (?:                    #start non-capturing group #2
        \.\d+                #match float expression not requiring leading digits
        |                    #or
        \d{1,3}(?:,\d{3})+   #match number containing thousands separators
        |                    #or
        ό,τι                 #match literal greek phrase
        |                    #or
        <br\s*/>             #match html break tag
      )                      #end non-capturing group #2
      (*SKIP)(*FAIL)         #discard anything matched by group #2
      |                      #or
      (                      #start capture group #1
        \.{3}                #match three dots as ellipsis
        |                    #or
        [:,.]                #match literal colon, comma, or dot
      )                      #end capture group #1
    )                        #end non-capturing group #1
    \s*                      #zero or more whitespaces
    ~                        #ending pattern delimiter
    

    当您希望扩展您的模式以包含更多不合格规则时,只需添加另一个管道并添加一个子模式以匹配不需要的子字符串。

    为确保三个符合条件的点作为省略号匹配,请在检查单个字符之前进行匹配。

    代码:(Demo)

    $text = <<<TEXT
    Composition:80% Polyamide,   15% Elastane, 5% Wool.
    Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
    Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
    
    What about $1,234,567.89?
    Or....1mm one tenth of a millimeter?
    
    ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
    Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time ,being a unique choice for those who want to stand out .Made of rubber.<br />- Softfoam floor<br />- Binding with laces
    
    Specs:<br />&bull; Something<br /><br />&bull; Something else<br />&bull; One more
    
    Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play.<br />It consists of a cardigan and trousers ,made of soft fabric and have rib cuffs and legs for a better fit.<br /><br />&bull; Normal fit<br /><br />&bull; Cardigan  :Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />&bull; Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry,there'll be ...more!
    TEXT;
    
    echo preg_replace(
             '~\s*(?:(?:\.\d+|\d{1,3}(?:,\d{3})+|ό,τι|<br\s*/>)(*SKIP)(*FAIL)|(\.{3}|[:,.]))\s*~',
             '$1 ',
             $text
         );
    

    输出:

    Composition: 80% Polyamide, 15% Elastane, 5% Wool. Side length 50.5 cm <---- THIS SHOULDN'T BE MATCHED
    Value 4,500 <---- THIS SHOULDN'T BE MATCHED EITHER
    
    What about $1,234,567.89?
    Or... .1mm one tenth of a millimeter?
    
    ό,τι<---- THIS IS A FIXED PHRASE IN GREEK AND THEREFORE SHOULDN'T BE MATCHED
    Comfort and timeless design characterize the Puma Smash V2 made of suede leather. They can be worn all the time, being a unique choice for those who want to stand out. Made of rubber. <br />- Softfoam floor<br />- Binding with laces
    
    Specs: <br />&bull; Something<br /><br />&bull; Something else<br />&bull; One more
    
    Children's Form Champion<br /><br />Children's set that will give a comfortable feeling for endless hours of play. <br />It consists of a cardigan and trousers, made of soft fabric and have rib cuffs and legs for a better fit. <br /><br />&bull; Normal fit<br /><br />&bull; Cardigan: Rib cuffs, zippers throughout length, high neck, Champion logo <br /> <br />&bull; Pants: Elastic waist with drawstring, ribbed legs, Champion logo. Don't worry, there'll be ...more!
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多