【发布时间】:2018-06-20 23:55:42
【问题描述】:
我有一个表示 XML 的大字符串。我正在尝试按如下方式提取节点数据:
String textToExtract = "<FnAnno>\r\n" +
" <PropDesc F_ANNOTATEDID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_BACKCOLOR=\"0\" F_BORDER_BACKMODE=\"2\" F_BORDER_COLOR=\"0\" F_BORDER_STYLE=\"0\" F_BORDER_WIDTH=\"1\" F_CLASSID=\"{5CF11941-018F-11D0-A87A-00A0246922A5}\" F_CLASSNAME=\"Text\" F_CREATOR=\"req92333\" F_ENTRYDATE=\"2018-06-19T13:15:43.0000000-05:00\" F_FONT_BOLD=\"true\" F_FONT_ITALIC=\"false\" F_FONT_NAME=\"arial\" F_FONT_SIZE=\"12\" F_FONT_STRIKETHROUGH=\"false\" F_FONT_UNDERLINE=\"false\" F_FORECOLOR=\"0\" F_HASBORDER=\"true\" F_HEIGHT=\"0\" F_ID=\"{60431964-0000-C411-9979-E6A21CEE873F}\" F_LEFT=\"3.430379746835443\" F_MODIFYDATE=\"2018-06-19T13:15:49.0000000-05:00\" F_MULTIPAGETIFFPAGENUMBER=\"1\" F_NAME=\"-1-1\" F_PAGENUMBER=\"1\" F_TEXT_BACKMODE=\"2\" F_TOOLTIP=\"0043007200650061007400650064002000420079003A002000720065007100390032003300330033002C0020002000430072006500610074006500640020004F006E003A002000320030003100380020004A0075006E0065002000310039002C002000310033003A00310035003A00340033002C0020005500540043002D0035\" F_TOOLTIPTRANSFERENCODING=\"hex\" F_TOP=\"1.3291139240506329\" F_WIDTH=\"0\">\r\n" +
" <F_CUSTOM_BYTES/>\r\n" +
" <F_POINTS/>\r\n" +
" <F_TEXT Encoding=\"unicode\">005400680069007300200069007300200061002000740065007300740020000A00280041006200680069006c0061007300680020004d007500740068007500720061006a00200036002f00310039002f00320030003100380029</F_TEXT>\r\n" +
" </PropDesc>\r\n" +
"</FnAnno>";
String extractedString =textToExtract.substring(textToExtract.indexOf("=\"unicode\">"),textToExtract.indexOf("</F_TEXT>")).replaceFirst("=\"unicode\">", "");
结果是00540068006900730020006900730073002000610020007400650065007300740020074002002002800280041006200680068006900690069006006CCCCCCCC CC
为了提高效率,我想用Pattern和matcher来提取子字符串。下面是我正在努力的代码:
Pattern pattern = Pattern.compile("\\bEncoding=.*?\\.*F_TEXT\\b");
Matcher matcher = pattern.matcher(textToExtract);
while (matcher.find()){
extractedString = (matcher.group());
}
上面的结果是 Encoding="unicode">005400680069007 我需要再次截断它。
如何只获取<F_TEXT Encoding=\"unicode\"> and </F_TEXT> 之间的数据?我在学校的正则表达式遇到了问题,甚至现在在工作中也遇到了问题:(我想我需要大量练习。
谢谢。
【问题讨论】:
-
不要使用正则表达式解析 XML。使用 XML 解析器。要“提高效率”,请使用 SAX。见:The Java™ Tutorials - Parsing an XML File Using SAX
标签: java regex string pattern-matching