类似 Nodejs 的 Scrapy 工具？ [关闭]答案

【问题标题】：Scrapy like tool for Nodejs? [closed]类似 Nodejs 的 Scrapy 工具？ [关闭]
【发布时间】：2014-10-30 11:44:54
【问题描述】：

我想知道是否有类似 Scrapy for nodejs 的东西？如果不是，您如何看待使用简单页面下载并使用cheerio 对其进行解析？有没有更好的办法。

【问题讨论】：

标签： javascript node.js web-scraping scrapy cheerio

【解决方案1】：

Scrapy 是一个为 python 添加异步 IO 的库。我们没有类似节点的原因是因为所有 IO 已经是异步的（除非你不需要它）。

这是一个scrapy脚本在节点中的样子，并注意url是同时处理的。

const cheerio = require('cheerio');
const axios = require('axios');

const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']

// this might be called a "middleware" in scrapy.
const get = async url => {
  const response = await axios.get(url)
  return cheerio.load(response.data)
}

// this too.
const output = item => {
  console.log(item)
}

// here is parse which is the initial scrapy callback
const parse = async url => {
  const $ = await get(url)
  output({url, title: $('title').text()})
}

// and here is the main execution
startUrls.map(url => parse(url))

【讨论】：

【解决方案2】：

我还没有见过像 Python 中的 Scrapy 这样强大的爬取/索引整个网站的解决方案，所以我个人使用 Python Scrapy 来爬取网站。

但是为了从页面中抓取数据，nodejs 中有 casperjs。这是一个非常酷的解决方案。它也适用于 ajax 网站，例如angular-js 页面。 Python Scrapy 无法解析 ajax 页面。因此，对于抓取一页或几页的数据，我更喜欢使用 CasperJs。

Cheerio 确实比 casperjs 快，但它不适用于 ajax 页面，而且它没有像 casperjs 这样好的代码结构。所以即使你可以使用cheerio包，我也更喜欢casperjs。

咖啡脚本示例：

casper.start 'https://reports.something.com/login', ->
  this.fill 'form',
    username: params.username
    password: params.password
  , true

casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
  this.click 'input'

casper.then ->
  get = (number) =>
    value = this.fetchText("tr[bgcolor= '#AFC5E4'] >  td:nth-of-type(#{number})").trim()

【讨论】：

【解决方案3】：

完全一样的吗？不，但既强大又简单？是的：crawler 快速示例：

var Crawler = require("crawler");

var c = new Crawler({
    maxConnections : 10,
    // This will be called for each crawled page
    callback : function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            var $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            console.log($("title").text());
        }
        done();
    }
});

// Queue just one URL, with default callback
c.queue('http://www.amazon.com');

// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
    uri: 'http://parishackers.org/',
    jQuery: false,

    // The global callback won't be called
    callback: function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            console.log('Grabbed', res.body.length, 'bytes');
        }
        done();
    }
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
    html: '<p>This is a <strong>test</strong></p>'
}]);

【讨论】：

【解决方案4】：

一些爬取功能可以通过Google Puppeteer 实现。根据文档：

您可以在浏览器中手动执行的大多数操作都可以使用Puppeteer 完成！以下是一些帮助您入门的示例：

生成页面的屏幕截图和 PDF。
抓取 SPA（单页应用程序）并生成预渲染内容（即“SSR”（服务器端渲染））。
自动提交表单、UI 测试、键盘输入等。
创建最新的自动化测试环境。使用最新的 JavaScript 和浏览器功能直接在最新版本的 Chrome 中运行测试。
捕获您网站的时间线轨迹以帮助诊断性能问题。
测试 Chrome 扩展程序。

【讨论】：

【解决方案5】：

以防万一您仍然需要答案， https://www.npmjs.org/package/scrapy 我从未测试过它，但认为它可以提供帮助。愉快的报废。

【讨论】：

无法配置此模块。它只返回公司名称和电话。我找到了一个可能的解决方案，其性能不如 Scrappy。但是通过使用 Cheerio 可以操纵页面。就像使用 Jquery 一样。