正则表达式查找字母和数字是否由符号分隔的单词答案

【问题标题】：Regex to find words with letters and numbers separated or not by symbols正则表达式查找字母和数字是否由符号分隔的单词
【发布时间】：2011-04-21 09:03:35
【问题描述】：

我需要构建一个正则表达式来匹配具有这些模式的单词：

字母和数字：

A35、35A、B503X、1ABC5

用“-”、“/”、“\”分隔的字母和数字：

AB-10、10-AB、A10-BA、BA-A10 等...

我为它写了这个正则表达式：

\b[A-Za-z]+(?=[(?<!\-|\\|\/)\d]+)[(?<!\-|\\|\/)\w]+\b|\b[0-9]+(?=[(?<!\-|\\|\/)A-Za-z]+)[(?<!\-|\\|\/)\w]+\b

它可以部分工作，但它仅匹配字母或仅由符号分隔的数字。示例：

10-10、开放式办公室等

而且我不想这样匹配。

我猜我的正则表达式非常重复并且有些难看。但这是我现在所拥有的。

谁能帮帮我？

我正在使用 java/groovy。

提前致谢。

【问题讨论】：

以后你可以玩转这个工具，它是我的救命稻草：regexpal.com
做什么和不想匹配有什么区别？每组必须包含字母和数字？
在这个字符串中：“10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR”，正则表达式需要匹配这个词：“10B, A10, UCS5000, DV-3000, 300-BR”。同一个单词中的字母和数字，用符号 -、/、\ 分隔 OR NOT
前导和/或尾随符号怎么样，例如-x4, 4x-, 4-x-, -4-x or -4-x-？
@fethz 请回答 user _unknown 的问题，我们确实需要这个答案才能制定正确的解决方案。

标签： java regex

【解决方案1】：

有趣的挑战。这是一个带有正则表达式的 java 程序，可以挑选出您所追求的“单词”类型：

import java.util.regex.*;
public class TEST {
    public static void main(String[] args) {
        String s = "A35, 35A, B503X, 1ABC5 " +
            "AB-10, 10-AB, A10-BA, BA-A10, etc... " +
            "10-10, open-office, etc.";
        Pattern regex = Pattern.compile(
            "# Match special word having one letter and one digit (min).\n" +
            "\\b                       # Match first word having\n" +
            "(?=[-/\\\\A-Za-z]*[0-9])  # at least one number and\n" +
            "(?=[-/\\\\0-9]*[A-Za-z])  # at least one letter.\n" +
            "[A-Za-z0-9]+              # Match first part of word.\n" +
            "(?:                       # Optional extra word parts\n" +
            "  [-/\\\\]                # separated by -, / or //\n" +
            "  [A-Za-z0-9]+            # Match extra word part.\n" +
            ")*                        # Zero or more extra word parts.\n" +
            "\\b                       # Start and end on a word boundary", 
            Pattern.COMMENTS);
        Matcher regexMatcher = regex.matcher(s);
        while (regexMatcher.find()) {
            System.out.print(regexMatcher.group() + ", ");
        } 
    }
}

这是正确的输出：

A35, 35A, B503X, 1ABC5, AB-10, 10-AB, A10-BA, BA-A10,

请注意，唯一“丑陋”的复杂正则表达式是那些没有正确格式化和注释的正则表达式！

【讨论】：

太棒了！这正是我需要的！谢谢 ridgerunner！

【解决方案2】：

就用这个吧：

([a-zA-Z]+[-\/\\]?[0-9]+|[0-9]+[-\/\\]?[a-zA-Z]+)

在 Java 中，\\ 和 \/ 应该被转义：

([a-zA-Z]+[-\\\/\\\\]?[0-9]+|[0-9]+[-\\\/\\\\]?[a-zA-Z]+)

【讨论】：

这个正则表达式也会匹配只有字母或数字的单词。
这几乎奏效了。如果我有这种情况：DV5-500，这个正则表达式只匹配 DV5。我将编辑我的问题以更清楚地了解可能性。
你不需要屏蔽斜线。

【解决方案3】：

请原谅我用 Python 编写我的解决方案，我对 Java 的了解不够，无法用 Java 编写。

pat = re.compile('(?=(?:([A-Z])|[0-9])' ## This part verifies that
                 '[^ ]*'                ## there are at least one
                 '(?(1)\d|[A-Z]))'      ## letter and one digit.
                 '('   
                 '(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'  # start of second group
                 '[A-Z0-9-/\\\\]*'
                 '[A-Z0-9](?= |\Z|,)'               # end of second group
                 ')',  
                 re.IGNORECASE) # this group 2 catches the string

.

我的解决方案在第二组中捕获了所需的字符串：((?:(?<={ ,])[A-Z0-9]|\A[A-Z0-9])[A-Z0-9-/\\\\]*[A-Z0-9](?= |\Z|,))

.

验证捕获的字符串中至少存在一个字母和一个数字之前的部分：

(?(1)\d|[A-Z]) 是一个条件正则表达式，意思是“如果 group(1) 抓到了东西，那么这里一定有一个数字，否则一定有一个字母”

组（1）是([A-Z]) in (?=(?:([A-Z])|[0-9])

(?:([A-Z])|[0-9]) 是一个非捕获组，它匹配一个字母（捕获的）或一个数字，所以当它匹配一个字母时，group(1) 不为空

.

标志 re.IGNORECASE 允许处理带有大写或小写字母的字符串。

.

在第二组中，我不得不写(?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])，因为不允许使用非固定长度的后向断言。这部分表示一个字符，不能在 '-' 前面加上空格或字符串的头部。

相反，(?= |\Z[,) 表示“字符串结尾或逗号或空格之后”

.

此正则表达式假定字符 '-' 、 '/' 、 '\' 不能是捕获字符串的第一个字符或最后一个字符。对吗？

import re

pat = re.compile('(?=(?:([A-Z])|[0-9])' ## (from here)  This part verifies that
                 '[^ ]*'                 #              there are at least one
                 '(?(1)\d|[A-Z]))'      ## (to here)    letter and one digit.
                 '((?:(?<=[ ,])[A-Z0-9]|\A[A-Z0-9])'
                 '[A-Z0-9-/\\\\]*'
                 '[A-Z0-9](?= |\Z|,))',
                 re.IGNORECASE) # this group 2 catches the string

ch = "ALPHA13 10 ZZ 10-10 U-R open-office ,10B a10 UCS5000 -TR54 code vg4- DV-3000 SEA 300-BR  gt4/ui bn\\3K"

print [ mat.group(2) for mat in pat.finditer(ch) ]

s = "A35, 35A, B503X,1ABC5 " +\
     "AB-10, 10-AB, A10-BA, BA-A10, etc... " +\
     "10-10, open-office, etc."

print [ mat.group(2) for mat in pat.finditer(s) ]

结果

['ALPHA13', '10B', 'a10', 'UCS5000', 'DV-3000', '300-BR', 'gt4/ui', 'bn\\3K']
['A35', '35A', 'B503X', '1ABC5', 'AB-10', '10-AB', 'A10-BA', 'BA-A10']

【讨论】：

【解决方案4】：

我的第一次成功

(^|\s)(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)(\s|$)

抱歉，它不是 java 格式的（您需要编辑 \ \s 等）。另外，你不能使用\b b/c 单词边界是任何不是字母数字和下划线的东西，所以我使用了\s 和字符串的开头和结尾。

这还是有点原始

编辑

版本 2，稍好一些，但可以通过使用所有格量词来提高性能。它匹配 ABC76 AB-32 3434-F 等，但不匹配 ABC 或 19\23 等。

((?<=^)|(?<=\s))(?!\d+[-/\\]?\d+(\s|$))(?![A-Z]+[-/\\]?[A-Z]+(\s|$))([A-Z0-9]+[-/\\]?[A-Z0-9]+)((?=$)|(?=\s))

【讨论】：

【解决方案5】：

可以省略一个条件（A OR NOT A）。所以符号可以被忽略。

for (String word : "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" "))
    if (word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)"))
         // do something

您没有提到 -x4、4x-、4-x-、-4-x 或 -4-x-，我希望它们都匹配。

我的表达式只查找 something-alpha-something-digits-something，其中可能是 alpha、数字或符号，反之亦然：something-alpha-something-digits-something。如果可能发生其他事情，例如 !#$~()[]{} 等等，它会变得更长。

用 scala 测试过：

scala> for (word <- "10 10-10 open-office 10B A10 UCS5000 code DV-3000 300-BR".split (" ")
     | if word.matches ("(.*[A-Za-z].*[0-9])|(.*[0-9].*[A-Za-z].*)")) yield word          
res89: Array[java.lang.String] = Array(10B, A10, UCS5000, DV-3000, 300-BR)

稍作修改以过滤匹配项：

String s = "A35, 35A, B53X, 1AC5, AB-10, 10-AB, A10-BA, BA-A10, etc. -4x, 4x- -4-x- 10-10, oe-oe, etc";
Pattern pattern  = java.util.regex.Pattern.compile ("\\b([^ ,]*[A-Za-z][^ ,]*[0-9])[^ ,]*|([^ ,]*[0-9][^ ,]*[A-Za-z][^ ,]*)\\b");
matcher = pattern.matcher (s);
while (matcher.find ()) { System.out.print (matcher.group () + "|") }

但我仍然有一个错误，我没有找到：

A35|35A|B53X|1AC5|AB-10|10-AB|A10-BA|BA-A10|-4x|4x|-4-x|

4x 应该是 4x-，-4-x 应该是 -4-x-。

【讨论】：

这是一个有趣的解决方案，但我不能用空格分割原始字符串（这是我正在开发的解决方案的规则）。谢谢！
拆分只是为了测试示例。你问的是matching words，而不是extracting matching words。