在页面加载几秒钟后检索页面的 html 内容答案

【问题标题】：Retrieve html content of a page several seconds after it's loaded在页面加载几秒钟后检索页面的 html 内容
【发布时间】：2017-05-08 00:13:19
【问题描述】：

我正在用 nodejs 编写一个脚本来自动从在线目录中检索数据。我知道我从来没有这样做过，所以我选择了 javascript，因为它是我每天都在使用的语言。

因此，我可以从谷歌上找到的一些技巧中找到，使用 Cheerios 请求轻松访问页面的 dom 组件。我找到并检索了所有必要的信息，唯一缺少的步骤是恢复到下一页的链接，除非该链接是在页面加载后 4 秒生成的，并且链接包含哈希，因此这一步是不可避免的。

我想做的是在加载后4-5秒恢复页面的dom，以便能够恢复链接

我在互联网上查看了很多关于使用 PhantomJS 进行此操作的建议，但在多次尝试使用 node 后我无法让它工作。

这是我的代码：

#!/usr/bin/env node
require('babel-register');
import request from 'request'
import cheerio from 'cheerio'
import phantom from 'node-phantom'

phantom.create(function(err,ph) {

  return ph.createPage(function(err,page) {

    return page.open(url, function(err,status) {

      console.log("opened site? ", status);
      page.includeJs('http://ajax.googleapis.com/ajax/libs/jquery/1.7.2/jquery.min.js', function(err) {

        //jQuery Loaded.
        //Wait for a bit for AJAX content to load on the page. Here, we are waiting 5 seconds.

        setTimeout(function() {

          return page.evaluate(function() {

            var tt = cheerio.load($this.html())
            console.log(tt)

          }, function(err,result) {

            console.log(result);
            ph.exit();

          });

        }, 5000);

      });
    });
  });
});

但我收到此错误：

return ph.createPage(function (page) { ^

TypeError: ph.createPage 不是函数

我将要做的事是做我想做的事的最佳方式吗？如果不是最简单的方法是什么？如果是这样，我的错误来自哪里？

【问题讨论】：

标签： javascript jquery node.js phantomjs cheerio

【解决方案1】：

如果你不必使用 phantomjs 你可以使用nightmare 来做。

它是一个非常简洁的库来解决像你这样的问题，它使用电子作为网络浏览器，你可以在有或没有显示窗口的情况下运行它（你也可以在谷歌浏览器中打开开发者工具）

如果你想在没有图形界面的服务器上运行它，它只有一个缺陷，你必须至少安装帧缓冲区。

Nightmare 有类似于 wait(cssSelector) 的方法，它会等到某个元素出现在网站上。

您的代码将类似于：

const Nightmare = require('nightmare');
const nightmare = Nightmare({
    show: true, // will show browser window
    openDevTools: true // will open dev tools in browser window 
});

const url = 'http://hakier.pl';
const selector = '#someElementSelectorWitchWillAppearAfterSomeDelay';

nightmare
        .goto(url)
        .wait(selector)
        .evaluate(selector => {
    return {
        nextPage: document.querySelector(selector).getAttribute('href')
    };
}, selector)
.then(extracted => {
    console.log(extracted.nextPage); //Your extracted data from evaluate
});
//this variable will be injected into evaluate callback
//it is required to inject required variables like this,
// because You have different - browser scope inside this
// callback and You will not has access to node.js variables not injected

黑客愉快！

【讨论】：

我不能用噩梦做我想做的事，在这个链接上pagesjaunes.fr/annuaire/…。我想在类“link_pagination next pj-lb pj-link”的元素中获取href，但链接是“#”，直到您单击具有这些类名称的标签。那么如何在不更改页面的情况下获取链接？（因为按钮是下一页的一段）
可以点击获取当前页面url nightmare.goto(url).click('#pagination.next').url().then(url => console.log('newUrl ', 网址));
是的，我尝试过类似的方法，但我得到：相同的 url，末尾带有 '#'，所以它不起作用。 :/ 谢谢你的帮助，我不习惯网络抓取