xml字符串java之间的子字符串提取答案

【问题标题】：Substring Extract between an xml string javaxml字符串java之间的子字符串提取
【发布时间】：2018-06-20 23:55:42
【问题描述】：

我有一个表示 XML 的大字符串。我正在尝试按如下方式提取节点数据：

        String textToExtract = "<FnAnno>\r\n" + 
                "   <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" + 
                "       <F_CUSTOM_BYTES/>\r\n" + 
                "       <F_POINTS/>\r\n" + 
                "       <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" + 
                "   </PropDesc>\r\n" + 
                "</FnAnno>";
String      extractedString =textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");

结果是00540068006900730020006900730073002000610020007400650065007300740020074002002002800280041006200680068006900690069006006CCCCCCCC CC

为了提高效率，我想用Pattern和matcher来提取子字符串。下面是我正在努力的代码：

    Pattern pattern = Pattern.compile("\\bEncoding=.*?\\.*F_TEXT\\b");
    Matcher matcher = pattern.matcher(textToExtract);
    while (matcher.find()){
        extractedString = (matcher.group());
    }

上面的结果是 Encoding="unicode">005400680069007 我需要再次截断它。

如何只获取<F_TEXT Encoding=\"unicode\"> and </F_TEXT> 之间的数据？我在学校的正则表达式遇到了问题，甚至现在在工作中也遇到了问题：（我想我需要大量练习。

谢谢。

【问题讨论】：

不要使用正则表达式解析 XML。使用 XML 解析器。要“提高效率”，请使用 SAX。见：The Java™ Tutorials - Parsing an XML File Using SAX

标签： java regex string pattern-matching

【解决方案1】：

如果您总是要在相同的 XML 标记之间检索数据，那么您无需担心将其解析为数据结构。你有正确的想法。如果速度是你所追求的，只需抓住你知道会在那里的标记之间的字符串。

不过，您的方式是在浪费一些周期。

textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");

让我们分解一下：

// loops through the array until "=\"unicode\">" is found
int startIndex = textToExtract.indexOf("=\"unicode\">");
// loops through the array again, until "</F_TEXT>" is found
int endIndex = textToExtract.indexOf("</F_TEXT>");
//loop through the array, copying the bytes to a new array to form a new String
String substr = textToExtract.substring(startIndex,endIndex);
//loop through the array to find and replace "=\"unicode\">" with nothing
String data = substr.replaceFirst("=\"unicode\">", "");

你在同一个数组中循环了很多次。

一旦知道起点在哪里，就无需再从头开始搜索。相反，从那个起点开始寻找。然后，一旦你有了你的子字符串的起点和终点，你就可以简单地得到它。

// we know what precedes the substring we want
String anchor = "<F_TEXT Encoding=\"unicode\">";
// so we use it to get the start point, looping once, up to that point
int start = textToExtract.indexOf(anchor)+anchor.length();
// we know the end point won't be before the start point, so start where it left off
int end = start;
// count each character from that point until the next XML tag starts
while (textToExtract.charAt(end) != '<') { end++; }
// now we have what we need to simply get the substring
String data = textToExtract.substring(start,end);

这将使性能提高约 60%。

编辑：为了完整起见，让我们解决正则表达式

正则表达式很棒，而且在脚本中很有趣，但是对于这样的事情效率很低。如果您可以避免使用正则表达式，请这样做。我倾向于只使用它来“快速而肮脏” - 在编码时间而不是执行时间方面快速。阅读正则表达式引擎的工作原理。这真的很有趣，但你会明白为什么它是最后的手段。

    /* this pattern will look for the XML tag.
    ** then, it will match [^>]+
    ** [...] will match a single character that matches SOMETHING inside the "character class."
    ** [^...] will match a single character that is NOT something inside the character class.
    ** [^>]+ will match as many characters as it can that do not match '>'
    ** putting this expression inside brackets tells the engine we want to capture it to be referenced later.
    ** '<' at the end just ensures we capture up until that point.
    */
    // create the pattern
    Pattern pattern = Pattern.compile("<F_TEXT Encoding=\"unicode\">([^>]+)<");
    // get a matcher for it
    Matcher matcher = pattern.matcher(textToExtract);
    // if we find a match
    if (matcher.find()) {
        // we can use group(1) to refer to our first capture group
        // group(0) will always return the full string matched, but we don't want the tags.
        String data= matcher.group(1);

    }

【讨论】：

用子字符串分解是非常有创意的，我看到正则表达式用于较小的输入。干杯！
在运行具有字符串匹配和 SAX 解析的 Andreas 提供的 nano 测试比较后：这是结果，我运行了 10 次测试 String Match - getAnnotIdFromF_TEXT - Time to convert :331956 SAX parser - getAnnotIdFromF_TEXT_XML - Time转换：23137415

【解决方案2】：

不要使用正则表达式来解析 XML。使用 XML 解析器。

要“提高效率”，请使用 SAX，例如像这样：

String textToExtract = "<FnAnno>\r\n" + 
                       "   <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" + 
                       "       <F_CUSTOM_BYTES/>\r\n" + 
                       "       <F_POINTS/>\r\n" + 
                       "       <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" + 
                       "   </PropDesc>\r\n" + 
                       "</FnAnno>";

StringBuilder buf = new StringBuilder();

SAXParserFactory factory = SAXParserFactory.newInstance();
SAXParser parser = factory.newSAXParser();
parser.parse(new InputSource(new StringReader(textToExtract)), new DefaultHandler() {
    private boolean captureText;
    @Override
    public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
        this.captureText = qName.equals("F_TEXT");
    }
    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        this.captureText = false;
    }
    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        if (this.captureText)
            buf.append(ch, start, length);
    }
});

System.out.println(buf.toString());

输出

005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029

【讨论】：

我等不及要得到原始字符串文字，这样这样的东西会更好看:)