【问题标题】:Scraping XML with JSoup使用 JSoup 抓取 XML
【发布时间】:2013-01-26 14:32:18
【问题描述】:

我正在尝试抓取位于 here 的 RSS 提要。

目前我只是想了解一下 JSoup,所以下面的代码只是概念证明(或者至少是一种尝试)。

    public static void grabShakers(String url) throws IOException {

    doc = Jsoup.connect(url).get();


    desc = doc.select("title");
    links = doc.select("link");
    price = doc.select("span.price");

}

它完美地抓住了每个项目的标题。每个链接的输出只是十个重复的结束链接标签,它永远不会找到任何价格。我想也许 CDATA 是问题所在,所以我将 doc 转换为 html,使用 .replace 删除了 cmets,然后将其转换回 Document 以进行解析,但无济于事。任何见解将不胜感激。

以下代码是我用来打印每个元素的代码:

for (Element src : price) {
        System.out.println(src);
    }

【问题讨论】:

    标签: java jsoup scrape


    【解决方案1】:

    该提要存在两个问题

    1. 文档只包含<link />..actual link..,而不是完整的链接标签
    2. 描述(包含price标签)是转义 Html,不会被解析

    解决方案:

        final String url = "http://www.amazon.com/gp/rss/movers-and-shakers/appliances/ref=zg_bsms_appliances_rsslink";
        Document doc = Jsoup.connect(url).get();
    
    
        for( Element item : doc.select("item") ) // Select all items
        {
            final String title = item.select("title").first().text(); // select the 'title' of the item
            final String link = item.select("link").first().nextSibling().toString().trim(); // select 'link' (-1-)
    
            final Document descr = Jsoup.parse(StringEscapeUtils.unescapeHtml4(item.select("description").first().toString()));
            final String price = descr.select("span.price").first().text(); // select 'price' (-2-)
    
            // Output - Example
            System.out.println(title);
            System.out.println(link);
            System.out.println(price);
            System.out.println();
        }
    

    注意 1: 链接的解决方法;选择(空)link 标记并获取 next 节点的文本(= 带有实际链接的 TextNode)。

    注意 2: 解决价格问题;选择description 标签,取消转义html,解析它并选择价格。为了避免转义,我使用了来自Apache Commons LangStringEscapeUtils.unescapeHtml4()

    输出:
    (使用上面的链接)

    #1: Epicurean Gourmet Series 20-Inch-by-15-Inch Cutting Board with Cascade Effect, Nutmeg with Natural Core
    http://www.amazon.com/Epicurean-Gourmet-20-Inch-15-Inch-Cutting/dp/B003MU9PLU/ref=pd_zg_rss_ms_la_appliances_1
    $72.95
    
    #2: GE 45600 Z-Wave Basic Handheld Remote
    http://www.amazon.com/GE-45600-Z-Wave-Handheld-Remote/dp/B0013V6RW0/ref=pd_zg_rss_ms_la_appliances_2
    $3.00
    
    #3: First Alert RD1 Radon Gas Test Kit
    http://www.amazon.com/First-Alert-RD1-Radon-Test/dp/B00002N83E/ref=pd_zg_rss_ms_la_appliances_3
    $10.60
    
    #4: Presto 04820 PopLite Hot Air Popper, White
    http://www.amazon.com/Presto-04820-PopLite-Popper-White/dp/B00006IUWA/ref=pd_zg_rss_ms_la_appliances_4
    $9.99
    
    #5: New 20 oz Espresso Coffee Milk Frothing Pitcher, Stainless Steel, 18/8 gauge
    http://www.amazon.com/Espresso-Coffee-Frothing-Pitcher-Stainless/dp/B000FNK3Z4/ref=pd_zg_rss_ms_la_appliances_5
    $8.19
    
    #6: PUR 18 Cup Dispenser with One Pitcher Filter DS-1800Z
    http://www.amazon.com/PUR-Dispenser-Pitcher-Filter-DS-1800Z/dp/B0006MQCA4/ref=pd_zg_rss_ms_la_appliances_6
    $22.17
    
    #7: Hamilton Beach 70610 500-Watt Food Processor, White
    http://www.amazon.com/Hamilton-Beach-70610-500-Watt-Processor/dp/B000SAOF5S/ref=pd_zg_rss_ms_la_appliances_7
    $21.95
    
    #8: West Bend 77203 Electric Can Opener, Metallic
    http://www.amazon.com/West-Bend-77203-Electric-Metallic/dp/B00030J1U2/ref=pd_zg_rss_ms_la_appliances_8
    $35.79
    
    #9: Custom Leathercraft 2077L Black Ski Glove, Large
    http://www.amazon.com/Custom-Leathercraft-2077L-Black-Glove/dp/B00499BS9A/ref=pd_zg_rss_ms_la_appliances_9
    $8.83
    
    #10: Cuisinart CPC-600 1000-Watt 6-Quart Electric Pressure Cooker, Brushed Stainless and Matte Black
    http://www.amazon.com/Cuisinart-CPC-600-1000-Watt-Electric-Stainless/dp/B000MPA044/ref=pd_zg_rss_ms_la_appliances_10
    $64.95
    

    【讨论】:

    猜你喜欢
    • 2021-07-24
    • 2019-04-14
    • 2013-06-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-10-21
    相关资源
    最近更新 更多