如何使用 Cheerio 从这个 html 中获取图像 src、标题和描述？答案

【问题标题】：How can I get image src, title and the description from this html using cheerio?如何使用 Cheerio 从这个 html 中获取图像 src、标题和描述？
【发布时间】：2017-09-24 17:56:52
【问题描述】：

我正在尝试使用带有cheerio 的nodejs 从网站中提取一些内容。我要提取以下内容：

“这是我的示例标题文本”文本。
“这将是我的描述内容”文本。
图片来源。

这里是html：

     <body>
     <div class="detail_loop">
         <img class="imfast" data-original="http://www.example.com/wp-content/uploads/2017/03/imageurl-250x150.jpg" title=""
              align="left" width="250" height="150"
              src="http://www.example.com/wp-content/uploads/2017/03/imageurl-250x150.jpg" style="display: block;">
         <h2>
             <a href="http://www.example.com/2017/04/576487/" rel="bookmark">This is my titile text</a>
         </h2>
         Here will be my description content.
         <div class="clear"></div>
         <div class="send_loop" style="display: none;">
             <a href="http://www.example.com/2017/04/576487//#respond" target="_blank">
                 <div class="send_com">
                     <div class="send_bubb">
                         <div class="count">
                             0
                         </div>
                     </div>
                 </div>
             </a>
             <a href="https://www.facebook.com/sendr.php?u=http://www.example.com/2017/04/576487/" target="_blank">
                 <div class="send_fb">
                     <div class="send_bubb">
                         <div class="count">
                             send
                         </div>
                     </div>
                 </div>
             </a>
             <a href="https://twitter.com/send?url=http://www.example.com/2017/04/576487/&amp;text=this is sample title;hashtags=example"
                target="_blank">
                 <div class="send_tt">
                     <div class="send_bubb">
                         <div class="count">
                             Tweet
                         </div>
                     </div>
                 </div>
             </a>
             <div class="clear"></div>
         </div>
         <div class="clear"></div>
         <div class="detail_loop_dvd"></div>
         <div class="clear"></div>
     </div>
    </body>

【问题讨论】：

标签： javascript html node.js cheerio scraper

【解决方案1】：

你的目标是什么？你当然可以简单地传递数据：cheerio.load('<html><body>…</html>')

示例代码

注意：.text() 将返回所有子节点（其他

等），因此过滤器仅在文本节点上返回 true。 –[#20832910]
const cheerio = require('cheerio'); const fs = require('fs'); /** * Given data saved in file 'index.html' in current path */ fs.readFile('index.html', {encoding: 'utf-8'}, (err, data) => { if (err) { console.log(err); return; } const $ = cheerio.load(data); /** * Print what you desire */ console.log($('h2 a').text()); // Title text console.log($('div.detail_loop').contents().filter( function() { return this.type === 'text'; }).text()); // Description content (without child nodes--only text) console.log($('img').attr('src')); // Image source });

【讨论】：

字符串方法 .replace(/(^\s+|\s+$)/g, '') 可能有助于修剪您执行其方法的字符串之前/之后（从行首到第一个非空白字符...）的所有空格