【问题标题】:JSoup parsing HTML table in divJSoup在div中解析HTML表格
【发布时间】:2016-05-10 08:25:42
【问题描述】:

我正在尝试抓取以下网站:

http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SSearchWidget

我正在连接到该站点并解析 html 表,如下所示:

Document doc = Jsoup
                           .connect("http://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SSearchWidget")
                           .data("FLAT_TYPE", "02")
                           .data("NME_NEWTOWN", "BD      Bedok")
                           .data("NME_STREET", "")
                           .data("NUM_BLK_FROM", "")
                           .data("NUM_BLK_TO", "")
                           .data("dteRange", "12")
                           .data("DTE_APPROVAL_FROM", "May 2015")
                           .data("DTE_APPROVAL_TO", "May 2016")
                           .data("AMT_RESALE_PRICE_FROM", "")
                           .data("AMT_RESALE_PRICE_TO", "")
                           .data("Process", "continue")
                           .cookies(cookies)
                           .timeout(0)
                           .post();

            Element table =     doc.getElementsByTag("table").first();

我也尝试了以下方法,但表格仍然为空:

Element tableBody = doc.select("div[class=content]").select("table").first();

但是桌子总是空的。请有人告诉我哪里做错了。 提前致谢。

【问题讨论】:

  • 因为您的last post 已添加脚本到站点以阻止机器人。请参阅@Martic 的帖子了解可行的解决方案
  • @nyname00 很有趣。 :) Martic 的解决方案对我有用。谢谢。

标签: html web-crawler jsoup


【解决方案1】:

您必须在请求中添加另一个参数:

工作代码:

    try {

        String url = "https://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SSearchWidget";

        Connection.Response response = Jsoup
                .connect(url)
                .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko)" +
                        " Chrome/33.0.1760.152 Safari/537.36")
                .ignoreHttpErrors(true)
                .method(Connection.Method.GET)
                .execute();

        Document responseDocument = Jsoup.parse(response.body());

        Element rtisEnqFlagID = responseDocument.select("div.row input[type=hidden]").last();
        String name = rtisEnqFlagID.attr("name");
        String value = rtisEnqFlagID.attr("value");

        Document document = Jsoup.connect(url)
                .userAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko)" +
                        " Chrome/33.0.1750.152 Safari/537.36")
                .data("FLAT_TYPE", "02")
                .data("NME_NEWTOWN", "BD      Bedok")
                .data("NME_STREET", "")
                .data("NUM_BLK_FROM", "")
                .data("NUM_BLK_TO", "")
                .data("dteRange", "12")
                .data("DTE_APPROVAL_FROM", "May 2015")
                .data("DTE_APPROVAL_TO", "May 2016")
                .data("AMT_RESALE_PRICE_FROM", "")
                .data("AMT_RESALE_PRICE_TO", "")
                .data("Process", "continue")
                .data(name, value)
                .cookies(response.cookies())
                .post();

        Elements tableBody = document.select("div.content table");

        for (Element table : tableBody)
            System.out.println(table);

    } catch (IOException e) {
        e.printStackTrace();
    }

输出:

<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>514</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>07 to 09</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1979</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$240,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Jun 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>101</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>07 to 09</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1978</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$240,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Nov 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>113</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>10 to 12</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>44.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1978</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$244,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Mar 2016</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>535</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>01 to 03</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$250,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Jan 2016</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>534</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>04 to 06</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$248,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Nov 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>535</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>10 to 12</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$230,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Nov 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>535</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>04 to 06</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$246,500.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Oct 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>541</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>10 to 12</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1985</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$238,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Jul 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>620</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>07 to 09</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$250,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Mar 2016</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>618</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>04 to 06</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$250,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>Feb 2016</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>620</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>01 to 03</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>45.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1986</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$245,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>May 2015</span></td> 
  </tr> 
 </tbody>
</table>
<table style="margin-bottom: .5em; width: 100%;"> 
 <tbody>
  <tr> 
   <th width="46%" style="text-align: left;"><span>Block</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>38</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Storey</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>07 to 09</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Floor Area (sqm)/Flat Model</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>44.00 <br>Improved</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Lease Commence Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>1978</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Price</span> </th> 
   <td width="54%" style="vertical-align: middle;"><span>$253,000.00</span></td> 
  </tr> 
  <tr> 
   <th width="46%" align="left" style="text-align: left;"><span>Resale Registration Date</span></th> 
   <td width="54%" style="vertical-align: middle;"><span>May 2015</span></td> 
  </tr> 
 </tbody>
</table>

【讨论】:

  • 嗨 Martic,我尝试使用上面的代码连接到该站点,但现在似乎无法正常工作。是不是又变了?抱歉,我找不到如何连接。
  • 非常感谢,马蒂奇。你能否让我知道如何解决这个问题。我尝试在 chrome 开发人员工具中打开该网站,但找不到更改的内容。再次感谢。
  • 你能告诉我如何连接到这个网站吗?我尝试连接,但现在出现错误:org.jsoup.HttpStatusException: HTTP error fetching URL。状态=405。
  • 我尝试为请求添加以下参数,但没有运气:.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp ,/;q=0.8") .header("Accept-Encoding", "gzip, deflate, br") .header("Accept-Language", "en-US,en;q= 0.8,si;q=0.6") .header("Cache-Control", "max-age=0") .header("Connection", "keep-alive")
【解决方案2】:

该站点现在使用HTTPS 协议。将您的 URL 更改为
String url = "https://services2.hdb.gov.sg/webapp/BB33RTIS/BB33SSearchWidget";(https 而不是 http),它将起作用。

【讨论】:

  • 感谢 TDG。我将 url 更改为 https,但它仍然无法正常工作,然后我也不得不添加这些标题:.header("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,image /webp,/;q=0.8") .header("Accept-Encoding", "gzip, deflate, br") .header("Accept-Language", "en-US,en;q=0.8,si;q =0.6") .header("Cache-Control", "max-age=0") .header("Connection", "keep-alive") 现在可以正常工作了。
  • 很高兴能帮上忙。我按原样使用了 Martic 的代码,只是更改为 https 并且它工作正常,没有添加您必须添加的所有标题。
猜你喜欢
  • 2015-08-17
  • 2014-01-01
  • 2019-10-10
  • 1970-01-01
  • 2012-10-01
  • 2018-09-04
  • 2015-10-14
相关资源
最近更新 更多