如何删除 <script></script> 标签之间的文本答案

【问题标题】：How to remove text between <script></script> tags如何删除 <script></script> 标签之间的文本
【发布时间】：2015-12-26 21:26:52
【问题描述】：

我想删除<script></script>tags 之间的内容。我正在使用 while 循环手动检查模式和iterating。但是，我在这一行得到StringOutOfBoundException：

String script = source.substring(startIndex,endIndex-startIndex);

下面是完整的方法：

public static String getHtmlWithoutScript(String source) {
    String START_PATTERN = "<script>";
    String END_PATTERN = " </script>";
    while (source.contains(START_PATTERN)) {
        int startIndex=source.lastIndexOf(START_PATTERN);
        int endIndex=source.indexOf(END_PATTERN,startIndex);

        String script=source.substring(startIndex,endIndex);
        source.replace(script,"");
    }
    return source;
}

我在这里做错了吗？我得到endIndex=-1。谁能帮我确定我的代码为什么会出错。

【问题讨论】：

标签： java html html-parsing

【解决方案1】：

String text = "<script>This is dummy text to remove </script> dont remove this";
    StringBuilder sb = new StringBuilder(text);
    String startTag = "<script>";
    String endTag = "</script>";

    //removing the text between script
    sb.replace(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag), "");

    System.out.println(sb.toString());

如果您也想删除脚本标签，请添加以下行：

sb.toString().replace(startTag, "").replace(endTag, "")

更新：

如果你不想使用StringBuilder，你可以这样做：

    String text = "<script>This is dummy text to remove </script> dont remove this";
    String startTag = "<script>";
    String endTag = "</script>";

    //removing the text between script
    String textToRemove = text.substring(text.indexOf(startTag) + startTag.length(), text.indexOf(endTag));
    text = text.replace(textToRemove, "");

    System.out.println(text);

【讨论】：

【解决方案2】：

您可以使用正则表达式来删除脚本标签内容：

public String removeScriptContent(String html) {
         if(html != null) {
            String re = "<script>(.*)</script>";

            Pattern pattern = Pattern.compile(re);
            Matcher matcher = pattern.matcher(html);
            if (matcher.find()) {
                return html.replace(matcher.group(1), "");
            }
        }
        return null;
     }

你必须添加这两个导入：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

【讨论】：

你真的测试过这个吗？
我会让它变得懒惰。

【解决方案3】：

我知道我可能迟到了。但我想给你一个正则表达式（经过真正测试的解决方案）。

这里你需要注意的是，当涉及到正则表达式时，它们的引擎默认是贪婪的。因此，诸如<script>(.*)</script> 之类的搜索字符串将匹配从<script> 开始直到行尾或文件结尾的整个字符串，具体取决于所使用的正则表达式选项。 这是因为搜索引擎默认使用贪婪匹配。

现在为了以准确的方式执行您想要的匹配...您可以使用“惰性”搜索。

延迟加载搜索 <script>(.*?)<\/script>

现在，您将获得准确的结果。

您可以在answer 中阅读有关 Regexp Lazy & Greedy 的更多信息。

【讨论】：