【问题标题】:Matching plan text to HTML content将计划文本与 HTML 内容匹配
【发布时间】:2017-11-21 21:12:01
【问题描述】:

我需要在植物文本与HTML内容之间进行匹配,一旦找到匹配,我需要提取匹配的HTML内容(不改变HTML内容因为我需要完全相同的HTML内容),我可以使用 java regex 实用程序在许多场景中进行匹配,但在以下场景中失败。

下面是我用来匹配文本和 HTML 字符串的示例代码

public static void main(String[] args) {

    String text = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";
    String regex = "A crusader for the rights of the weaker sections of the Association's (ADA's) ".replaceAll(" ", ".*");

    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(text);
    // Check all occurrences
    while (matcher.find()) {

        System.out.print("Start index: " + matcher.start());
        System.out.print(" End index: " + matcher.end());
        System.out.println(" Found: " + matcher.group());

    }
}

在边缘情况下会失败

案例一:

原文: = "A crusader for the rights of the weaker sections of the Association's (ADA's),choice as the presidential candidate is being seen as a political masterstroke.";

要匹配的文本 = "A crusader for the rights of the weaker sections of the Association's (ADA's)"

预期输出: “A crusader for the rights of the weaker sections of the Association's (ADA's)”

案例 2:

原文:

“<ul>
   <li>Lorem ipsum dolor sit amet, consectetuer adipiscing elit.</li>
   <li>Aliquam tincidunt mauris eu risus.</li>
   <li>Vestibulum auctor dapibus neque.</li>
see (<a href=\"https://www.webpagefx.com/web-design/html-ipsum/">HTML Content Sample </a>.)
</ul>”

要匹配的文本: “see (HTML Content Sample.)”

预期输出: “see (&lt;a href=\"https://www.webpagefx.com/web-design/html-ipsum/"&gt;HTML Content Sample &lt;/a&gt;.)”

案例 3: 原文: = "Initial history includes the following:&lt;/p&gt;\n&lt;p&gt;Documentation of &lt;li&gt;Aliquam tincidunt mauris eu risus.&lt;/li&gt;"

要匹配的文本 = "Initial history includes the following: Documentation of"

匹配的预期输出:”Initial history includes the following :&lt;/p&gt;\n&lt;p&gt;Documentation of”

【问题讨论】:

  • 首先有一些字符是在正则表达式中保留的,例如点、括号()这个怎么处理?
  • 我知道这不是很有帮助,但出于这里的原因,我个人不会推荐 RegEx 用于 html 操作。如果您绝对必须使用 RegEx,也可能有一些答案可以帮助您。 stackoverflow.com/questions/1732348/…
  • @YCF_L 括号 () 我用空格代替
  • @YCF_L 关于上述问题陈述的任何想法?
  • 嗯,这不是一个简单的问题@pankajdesai,因为在很多情况下,您不仅要避免括号,还要避免正则表达式中的点和其他保留字符:)

标签: java regex string


【解决方案1】:

我最近想出了一个匹配 HTML 标记的正则表达式,支持带引号的属性和带引号的属性中的转义引号:它类似于
&lt;([^'"&gt;]|"([^\\"]|\\"?)+"|'([^\\']|\\'?)+')+&gt;

我认为在 HTML 中搜索纯文本同时保留 HTML 的最简单方法是修改纯文本,使其忽略单词边界处的标签,à la

// Usage: htmlSearch("ab cd").matcher("<b>ab</b> <i>cd</i>").matches();
public static Pattern htmlSearch(String plain) {
    // Check for tags before and after every word, number and symbol
    plain = plain.replaceAll("[A-Za-z]+|\\d+|[^\\w\\s]", 
            "``TAGS``$0``TAGS``";
    // Check for tags wherever (one or more) spaces are found
    plain = plain.replaceAll("\\s+", "((\\s|&nbsp;)+|``TAGS``)*");
    // Handle special characters
    plain = plain
            .replace("<", "(<|&lt;|&#60;)")
            .replace(">", "(>|&gt;|&#62;)")
            .replace("&", "(&|&amp;|&#38;)")
            .replace("'", "('|&apos;|&#39;)")
            .replace("\"", "(\"|&quot;|&#34;)")
            .replaceAll("[()\\\\{}\\[\\].*+]", "\\$0");
    // Insert the ``TAGS`` pattern
    String tags = "(<([^'\">]"
                + "|\"([^\\\"]|\\\"?)+"
                + "|'([^\\']|\\'?)+')+>)*";
    plain = plain.replace("``TAGS``", tags);

    return Pattern.compile(plain);
}

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2022-08-22
    • 1970-01-01
    • 2022-12-05
    • 1970-01-01
    • 2019-12-15
    • 2010-09-22
    • 1970-01-01
    相关资源
    最近更新 更多