Java 正则表达式从 html 中删除标签答案

【问题标题】：Java Regular Expression to Remove tags from htmlJava 正则表达式从 html 中删除标签
【发布时间】：2017-08-16 15:45:37
【问题描述】：

<table><tr><td>HEADER</td><td>Header Value <supporting value></td></tr><tr><td>SUB</td><td>sub value. write to <test@gmail.com></td></tr><tr><td>START DATE</td><td>11/23/ 2016</td></tr><tr><td>END DATE</td><td>11/23/2016</td></tr></table>

以上文字是我的html字符串，需要提取HEADER、SUB、START DATE和END DATE的值。我使用 Jsoup 来提取值，但我遇到了非 html 元素标签的问题。 API 要么跳过这些元素，要么添加一个原本不存在的结束标记。

所以我的想法是用&lt;替换非html元素标签，然后使用Jsooup提取值

有什么建议吗？？

【问题讨论】：

必须尊重传统：stackoverflow.com/a/1732454/1225328
您正在寻求解决方案，但您没有很好地定义问题。您在寻找什么模式？
@sp00m 您不能使用正则表达式来解析整个 html 文档，但在这种情况下，只提取一些遵循明确定义的模式的值，是可能的。
@WiktorStribiżew 这不太一样。这些不是有效的 HTML 标记。
不是一个好主意。见stackoverflow.com/questions/701166/…。

标签： java regex jsoup

【解决方案1】：

您可能希望参考jSoup 来解析 HTML 文档。您可以使用此 api 提取和操作数据。

【讨论】：

不是有效的 HTML。
@shmosel 请详细说明
<supporting value> 是无效的 HTML 标记。另一个也一样。
@shmosel 在做 POC 时为我工作。
我确实使用了 Jsoup，当我使用 Jsoup 解析 html 字符串时，会自动添加结束标记 value> 和 @gmail.com>。所以我的计划是找到一个模式并为非html元素替换“”。之后我将使用 Jsoup 进行解析。

【解决方案2】：

您可以使用此正则表达式提取内容：

/<td>[^<]*<([^>]*)><\/td>/

假设标记布局看起来总是一样的。

虽然你不能使用正则表达式解析完整的 HTML 文档，因为它不是上下文无关的语言，但实际上像这样的部分提取是可能的。

【讨论】：

【解决方案3】：

找到解决方案，使用 /]+) 模式从 html 字符串中获取所有标签

然后将除TR和TD之外的所有标签替换为“&lt”“&gt”。当我使用 Jsoup 解析文本时，我得到了所需的值。

请在下面找到代码，

public class JsoupParser2 {

public static void main(String args[]) {

    String orginalString, replaceString = null;
    HashSet<String> tagSet = new HashSet<String>();
    HashMap<String,String> notes = new HashMap<String,String>();

    Document document = null;
    try{

        //Read the html content as String
        File testFile = new File("C:\\test.html");
        List<String> content = Files.readLines(testFile,  Charsets.UTF_8);
        String testContent = content.get(0);

        //Get all the tags present in the html content
        Pattern p = Pattern.compile("<([^\\s>/]+)");
        Matcher m = p.matcher(testContent);
        while(m.find()) {
            String tag = m.group(1);
            tagSet.add(tag);
        }

        //Replace the tags thats non-html
        for(String replaceTag : tagSet){
            if(!"table".equals(replaceTag) && !"tr".equals(replaceTag) && !"td".equals(replaceTag)){
                orginalString = "<"+replaceTag+">";
                replaceString = "&lt;"+replaceTag+"&gt;";
                testContent = testContent.replaceAll(orginalString, replaceString);
            }
        }

        //Parse the html content
        document = Jsoup.parse(testContent, "", Parser.xmlParser());

        //traverse through TR and TD to get to the values
        //store the values in the map
        Elements pTags = document.select("tr");
        for (Element tag : pTags) {
            if(!tag.getElementsByTag("td").isEmpty()){
                String key = tag.getElementsByTag("td").get(0).text().trim();
                String value = tag.getElementsByTag("td").get(1).html().trim();
                System.out.println("KEY : "+key); System.out.println("VALUE : "+value);
                notes.put(key, value);
                System.out.println("==============================================");
            }
        } 

    }catch (IOException e) {
        e.printStackTrace();
    }catch(IndexOutOfBoundsException ioobe){
        System.out.println("ioobe");
    }
}

}

【讨论】：