使用java Pattern and Matcher，如何获取第一个匹配的标签内容答案

【问题标题】：Using java Pattern and Matcher, how to get first matching tag content使用java Pattern and Matcher，如何获取第一个匹配的标签内容
【发布时间】：2013-11-25 16:57:49
【问题描述】：

我在 SoapMessage 中的内容如下所示：

<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/">
<soap:Header>
    <Action xmlns="http://www.w3.org/2005/08/addressing">http://service.xxx.dk/DialogModtag</Action>
    <MessageID xmlns="http://www.w3.org/2005/08/addressing">urn:uuid:382b4943-26e8-4698-a275-c3149d2d889e</MessageID>
    <To xmlns="http://www.w3.org/2005/08/addressing">http://xxx.dk/12345678</To>
    <RelatesTo xmlns="http://www.w3.org/2005/08/addressing">uuid:cb2320dc-c8ab-4880-94cb-2ab68129216f</RelatesTo>
</soap:Header>
<soap:Body xmlns:wsu="http://docs.oasis-open.org/wss/2004/01/oasis-200401-wss-wssecurity-utility-1.0.xsd" wsu:Id="id-2515">
    Some content ...
</soap:Body>

我正在尝试使用如下代码提取

标记中的标记的内容：

Pattern PATTERN_SOAP_ACTION = 
    Pattern.compile(".*Header.*Action.*>(.*)<.*Action.*Header.*", Pattern.DOTALL);

String text = readFile("c:\\temp\\DialogUdenBilag.xml");
Matcher matcherSoapAction = PATTERN_SOAP_ACTION.matcher(text);
if (matcherSoapAction.matches()) { System.out.println(matcherSoapAction.group(1)); }
else { System.out.println("SaopAction not found"); }

这似乎适用于小型肥皂消息。但是当 soap:Body 增长到 +1MB 时，matches() 函数调用需要几分钟才能完成。

有什么想法可以让我的正则表达式模式对 CPU 更友好？

【问题讨论】：

可能的答案：RegEx match open tags except XHTML self-contained tags

标签： java regex cpu matcher

【解决方案1】：

解决方案

您希望使用 XML 解析器来获得对 CPU 更友好的解决方案。

 XMLInputFactory factory = XMLInputFactory.newInstance();
 XMLStreamReader reader = factory.createXMLStreamReader(new FileInputStream("c:\\temp\\DialogUdenBilag.xml"));

 boolean found=false;
 boolean inHeader=false;
 String actionContent = "";

 while(!found && reader.hasNext()){
    if(reader.next() == XMLStreamConstants.START_ELEMENT) {
        String localName=reader.getLocalName());

        if ("Header".equalsIgnoreCase(localName) {
            inHeader = true;
        }

        if(inHeader && "Action".equalsIgnoreCase(localName) {

            int evt=reader.next();
            do {
               if (evt==XMLStreamConstants.CHARACTERS) {
                   actionContent = reader.getText().trim();
                   found=true;
                   break;
               }

               evt=reader.next();
            } while(evt != XMLStreamConstants.END_ELEMENT);

        }
    }
 }

 if (found) {
     System.out.println(actionContent);
 } else {
     System.out.println("SaopAction not found");
 }

讨论

这个小sn-p 有点长，但无需查看整个 XML 代码即可得到答案。其实sn-p在找到soap:Action标签的时候就停止了，然后返回这个标签的文本内容。

【讨论】：

你完全正确！我想使用您的解决方案，它甚至对于小型肥皂动作也能更快地工作。非常感谢您的帮助。

【解决方案2】：

使用正则表达式解析 XML 是邪恶的，may incur the Wrath of the One whose Name cannot be expressed in the Basic Multilingual Plane. 如果您需要解析 XML，请使用实际的 XML 解析器 - 这就是它的用途。像这样的情况也是 XPath 表达式的用途：

javax.xml.xpath.XPath xpath = javax.xml.xpath.XPathFactory.newInstance().newXPath();
xpath.setNamespaceContext(new NamespaceContextMap(
    "s", "http://schemas.xmlsoap.org/soap/envelope/",
    "a", "http://www.w3.org/2005/08/addressing"));
javax.xml.xpath.XPathExpression expression = xpath.compile("//s:Header/a:Action");
String result = expression.evaluate(new org.xml.sax.InputSource(new FileReader("c:\\temp\\DialogUdenBilag.xml")));

（请注意，NamespaceContextMap 不是标准类 - 有关实现，请参见 here。）

至于您的正则表达式：它被编写为不必要地匹配整个输入字符串，并进行大量最大匹配而不是最小匹配。如果您有一个更紧密地关注文档的相关位的表达式（例如，"<((?:\\w+:)?)?Header\\b[^>]*>.*?<((?:\\w+:)?)Action\\b[^>]*>(.*?)</\\2Action>.*?</\\1Header>"）并调用Matcher.find() 来进行子字符串匹配，那么您将消耗更少的 CPU。也就是说，用正则表达式解析 XML 是不好的做法——你真的应该使用 XML 解析器！

【讨论】：