【发布时间】:2017-11-21 21:12:01
【问题描述】:
我需要在植物文本与HTML内容之间进行匹配,一旦找到匹配,我需要提取匹配的HTML内容(不改变HTML内容因为我需要完全相同的HTML内容),我可以使用 java regex 实用程序在许多场景中进行匹配,但在以下场景中失败。
下面是我用来匹配文本和 HTML 字符串的示例代码
public static void main(String[] args) {
String text = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";
String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
// Check all occurrences
while (matcher.find()) {
System.out.print("Start index: " + matcher.start());
System.out.print(" End index: " + matcher.end());
System.out.println(" Found: " + matcher.group());
}
}
在边缘情况下会失败
案例一:
原文: = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";
要匹配的文本 = "A crusader for the rights of the weaker sections of the Association's (ADA's)"
预期输出: “A crusader for the rights of the weaker sections of the Association's (ADA's)”
案例 2:
原文:
“<ul>
<li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
<li>Aliquam tincidunt mauris eu risus.</li>
<li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”
要匹配的文本: “see (HTML Content Sample.)”
预期输出: “see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)”
案例 3:
原文: = "Initial history includes the following:</p>\n<p>Documentation of <li>Aliquam tincidunt mauris eu risus.</li>"
要匹配的文本 = "Initial history includes the following: Documentation of"
匹配的预期输出:”Initial history includes the following :</p>\n<p>Documentation of”
【问题讨论】:
-
首先有一些字符是在正则表达式中保留的,例如点、括号
()这个怎么处理? -
我知道这不是很有帮助,但出于这里的原因,我个人不会推荐 RegEx 用于 html 操作。如果您绝对必须使用 RegEx,也可能有一些答案可以帮助您。 stackoverflow.com/questions/1732348/…
-
@YCF_L 括号 () 我用空格代替
-
@YCF_L 关于上述问题陈述的任何想法?
-
嗯,这不是一个简单的问题@pankajdesai,因为在很多情况下,您不仅要避免括号,还要避免正则表达式中的点和其他保留字符:)