【问题标题】:Using JSoup to aggregate data使用 JSoup 聚合数据
【发布时间】:2013-11-13 22:56:38
【问题描述】:

我正在尝试使用JSouphttp://dictionary.reference.com/browse/quick 中获取一些内容。如果您转到该页面,您会看到他们组织数据的方式是将单词 quick 的每个“单词类型”(形容词、动词、名词)呈现为自己的部分,并且每个部分都包含 1+ 个定义列表。

为了让事情更复杂一点,每个定义中的每个单词都是指向另一个 dictionary.com 页面的链接:

quick
    adjective
        1. done, proceeding, or occurring with promptness or rapidity...
        2. that is over or completed within a short interval of time
        ...
        14. Archaic.
            a. endowed with life
            b. having a high degree of vigor, energy, ...
    noun
        1. living persons; the quick and the dead
        2. the tender, sensitive flesh of the living body...
        ...
    adverb
        ...

我想要做的是使用 JSoup 将单词类型及其各自的定义作为字符串列表获取,如下所示:

public class Metadata {
    // Ex: "adjective", "noun", etc.
    private String wordType;

    // Ex: String #1: "1. done, proceeding, or occurring with promptness or rapidity..."
    //     String #2: "that is over or completed within a short interval of time..."
    private List<String> definitions;
}

因此,该页面实际上由一个 List&lt;Metadata&gt; 组成,其中每个 Metadata 元素是一个与 1+ 个定义配对的单词类型。

我能够使用非常简单的 API 调用找到单词类型列表:

// Contains 1 Element for each word type, like "adjective", "noun", etc.
Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
Elements wordTypes = doc.select("div.body div.pbk span.pg");

但我正在努力弄清楚其他必要的 doc.select(...) 是什么,我必须做些什么才能获得每个 Metadata 实例。

【问题讨论】:

    标签: java css-selectors web-crawler jsoup


    【解决方案1】:

    如果您查看 Jsoup 从该页面获取的 HTML,您会看到类似

      <div class="body"> 
         <div class="pbk"> 
          <span class="pg">adjective </span> 
          <div class="luna-Ent">
           <span class="dnindex">1.</span>
           <div class="dndata">
            done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: 
            <span class="ital-inline">a quick response.</span> 
           </div>
          </div>
          <div class="luna-Ent">
           <span class="dnindex">2.</span>
           <div class="dndata">
            that is over or completed within a short interval of time: 
            <span class="ital-inline">a quick shower.</span> 
           </div>
          </div>
    ...
         <div class="pbk"> 
          <span class="pg">adverb </span> 
          <div class="luna-Ent">
           <span class="dnindex">19.</span>
           <div class="dndata">
            <a style="font-style:normal; font-weight:normal;" href="/browse/quickly">quickly</a>.
           </div>
          </div> 
         </div> 
    

    所以每个部分

    adjective
        1. done, proceeding, or occurring with promptness or rapidity...
        2. that is over or completed within a short interval of time
        ...
        14. Archaic.
            a. endowed with life
            b. having a high degree of vigor, energy, ...
    noun
        1. living persons; the quick and the dead
        2. the tender, sensitive flesh of the living body...
        ...
    adverb
        ...
    

    &lt;div class="pbk"&gt; 内,其中包含&lt;span class="pg"&gt;adjective &lt;/span&gt; 以及部分名称和div 中的定义&lt;div class="luna-Ent"&gt;。所以你可以尝试做类似的事情

    Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
    
    Elements sections = doc.select("div.body div.pbk");
    for (Element element : sections) {
        String elementType = element.getElementsByClass("pg").text();
        System.out.println("--------------------");
        System.out.println(elementType);
    
        for (Element definitions : element.getElementsByClass("luna-Ent"))
            System.out.println(definitions.text());
    
    }
    

    此代码将选择所有部分,并使用 element.getElementsByClass("pg") 查找部分名称,并使用它们在具有类 luna-Ent element.getElementsByClass("luna-Ent") 的 div 中的事实来查找定义(如果您想跳过数字 1.、@987654331 @你可以选择dndata类而不是luna-Ent)

    输出:

    --------------------
    adjective
    1. done, proceeding, or occurring with promptness or rapidity, as an action, process, etc.; prompt; immediate: a quick response.
    2. that is over or completed within a short interval of time: a quick shower.
    3. moving, or able to move, with speed: a quick fox; a quick train.
    4. swift or rapid, as motion: a quick flick of the wrist.
    5. easily provoked or excited; hasty: a quick temper.
    6. keenly responsive; lively; acute: a quick wit.
    7. acting with swiftness or rapidity: a quick worker.
    8. prompt or swift to do something: quick to respond.
    9. prompt to perceive; sensitive: a quick eye.
    10. prompt to understand, learn, etc.; of ready intelligence: a quick student.
    11. (of a bend or curve) sharp: a quick bend in the road.
    12. consisting of living plants: a quick pot of flowers.
    13. brisk, as fire, flames, heat, etc.
    14. Archaic. a. endowed with life. b. having a high degree of vigor, energy, or activity.
    --------------------
    noun
    15. living persons: the quick and the dead.
    16. the tender, sensitive flesh of the living body, especially that under the nails: nails bitten down to the quick.
    17. the vital or most important part.
    18. Chiefly British. a. a line of shrubs or plants, especially of hawthorn, forming a hedge. b. a single shrub or plant in such a hedge.
    --------------------
    adverb
    19. quickly.
    

    【讨论】:

      【解决方案2】:

      你去。顺便说一句,要测试 CSS 选择器,您可以在 Chrome 开发者工具中激活控制台并直接在他们的网站上测试这样的查询:jQuery('div.body div.pbk div.luna-Ent &gt; .dndata')

      Document doc = Jsoup.connect("http://dictionary.reference.com/browse/quick").get();
      Elements wordTypes = doc.select("div.body div.pbk");
      
      for (Element wordType : wordTypes) {
          Elements typeOfSpeech = wordType.select("span.pg");
      
          System.out.println("typeOfSpeech: " + typeOfSpeech.text());
      
          Elements elements = wordType.select("div.luna-Ent > .dndata");
      
          for (int i = 0; i < elements.size(); i++) {
              Element element = elements.get(i);
              System.out.println((i + 1) + ". " + element.text());
          }
      }
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2015-03-24
        • 2019-05-20
        • 1970-01-01
        • 2012-08-09
        • 2017-05-12
        相关资源
        最近更新 更多