【问题标题】:Get all <p> texts after <div> and between <h2> by using Jsoup使用 Jsoup 获取 <div> 之后和 <h2> 之间的所有 <p> 文本
【发布时间】:2017-05-20 01:29:59
【问题描述】:
<h2><span class="mw-headline" id="The_battle">The battle</span></h2>
<div class="thumb tright"></h2>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<p>text I want</p>
<h2>Second Title I want to stop collecting p tags after</h2>

我正在通过尝试废弃所有 p 标签来学习 Jsoup,这些标签按 wikipedia 网站上的标题排列。在这个问题的帮助下,我可以废弃 h2 之间的所有 p 标签:
extract unidentified html content from between two tags, using jsoup? regex?

通过使用

Elements elements = docx.select("span.mw-headline, h2 ~ p");

但是当它们之间存在&lt;div&gt; 时,我无法废弃它。这是我正在处理的维基百科网站: https://simple.wikipedia.org/wiki/Battle_of_Hastings

如何获取两个特定 h2 标签之间的所有 p 标签? 最好按 id 排序。

【问题讨论】:

    标签: java html web-scraping jsoup wikipedia


    【解决方案1】:

    试试这个选项:Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");

    示例代码:

    package jsoupex;
    
    import org.jsoup.Jsoup;
    import org.jsoup.helper.Validate;
    import org.jsoup.nodes.Document;
    import org.jsoup.nodes.Element;
    import org.jsoup.select.Elements;
    
    import java.io.IOException;
    
    /**
     * Example program to list links from a URL.
     */
    public class stackoverflw {
        public static void main(String[] args) throws IOException {
    
            //Validate.isTrue(args.length == 1, "usage: supply url to fetch");
            //String url = "http://localhost/stov_wiki.html";
            String url = "https://simple.wikipedia.org/wiki/Battle_of_Hastings ";
            //args[0];
            System.out.println("Fetching %s..." + url);
    
            Document doc = Jsoup.connect(url).get();
            Elements elements = doc.select("span.mw-headline, h2 ~ div, h2 ~ p");
    
            for (Element elem : elements) {
                if ( elem.hasClass("mw-headline")) {
                    System.out.println("************************");
                }
                System.out.println(elem.text());
                if ( elem.hasClass("mw-headline")) {
                    System.out.println("************************");
                } else {
                    System.out.println("");
                }           
            }
        }
    }
    

    【讨论】:

      【解决方案2】:
      public static void main(String[] args) {
          String entity =
                  "<h2><span class=\"mw-headline\" id=\"The_battle\">The battle</span></h2>" +
                  "<div class=\"thumb tright\"></h2>" +
                  "<p>text I want</p>" +
                  "<p>text I want</p>" +
                  "<p>text I want</p>" +
                  "<p>text I want</p>" +
                  "<h2>Second Title I want to stop collecting p tags after</h2>";
      
          Document element = org.jsoup.Jsoup.parse(entity,"", Parser.xmlParser());
          element.outputSettings().prettyPrint(false);
          element.outputSettings().outline(false);
          List<TextNode>text=getAllTextNodes(element);
      }
      
      private static List<TextNode> getAllTextNodes(Element newElementValue) {
          List<TextNode>textNodes = new ArrayList<>();
          Elements elements = newElementValue.getAllElements();
          for (Element e : elements){
              for (TextNode t : e.textNodes()){
                  textNodes.add(t);
                 
              }
          }
          return textNodes;
      }
      

      【讨论】:

        猜你喜欢
        • 2018-04-24
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-05-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-08-19
        相关资源
        最近更新 更多