抓取 CasperJS 或 PhantomJS 中的资源内容答案

【问题标题】：Grab the resource contents in CasperJS or PhantomJS抓取 CasperJS 或 PhantomJS 中的资源内容
【发布时间】：2012-07-17 21:57:16
【问题描述】：

我看到 CasperJS 有一个“下载”函数和一个“收到资源”回调，但我没有在回调中看到资源的内容，我不想将资源下载到文件系统。

我想获取资源的内容，以便在我的脚本中使用它。 CasperJS 或 PhantomJS 可以做到这一点吗？

【问题讨论】：

标签： phantomjs casperjs

【解决方案1】：

在过去的几天里，这个问题一直困扰着我。代理解决方案在我的环境中不是很干净，所以我发现 phantomjs 的 QTNetworking 核心在缓存资源时将资源放在了哪里。

长话短说，这是我的要点。您需要 cache.js 和 mimetype.js 文件： https://gist.github.com/bshamric/4717583

//for this to work, you have to call phantomjs with the cache enabled:
//usage:  phantomjs --disk-cache=true test.js

var page = require('webpage').create();
var fs = require('fs');
var cache = require('./cache');
var mimetype = require('./mimetype');

//this is the path that QTNetwork classes uses for caching files for it's http client
//the path should be the one that has 16 folders labeled 0,1,2,3,...,F
cache.cachePath = '/Users/brandon/Library/Caches/Ofi Labs/PhantomJS/data7/';

var url = 'http://google.com';
page.viewportSize = { width: 1300, height: 768 };

//when the resource is received, go ahead and include a reference to it in the cache object
page.onResourceReceived = function(response) {
  //I only cache images, but you can change this
    if(response.contentType.indexOf('image') >= 0)
    {
        cache.includeResource(response);
    }
};

//when the page is done loading, go through each cachedResource and do something with it, 
//I'm just saving them to a file
page.onLoadFinished = function(status) {
    for(index in cache.cachedResources) {
        var file = cache.cachedResources[index].cacheFileNoPath;
        var ext = mimetype.ext[cache.cachedResources[index].mimetype];
        var finalFile = file.replace("."+cache.cacheExtension,"."+ext);
        fs.write('saved/'+finalFile,cache.cachedResources[index].getContents(),'b');
    }
};

page.open(url, function () {
    page.render('saved/google.pdf');
    phantom.exit();
});

那么当你调用phantomjs的时候，只要确保缓存是开启的：

phantomjs --disk-cache=true test.js

一些注意事项：我写这个的目的是在不使用代理或拍摄低分辨率快照的情况下在页面上获取图像。 QT 对某些文本文件资源使用压缩，如果您将其用于文本文件，您将不得不处理解压缩。此外，我运行了一个快速测试以提取 html 资源，但它没有从结果中解析出 http 标头。但是，这对我很有用，希望其他人会发现它，如果您对特定内容类型有问题，请修改它。

【讨论】：

你是怎么减压的？
很想知道你是如何解压的。你搞定了吗？
您先生，是一名士兵。谢谢你。
不工作了，phantomjs 使用 sqlite 进行缓存
看起来不错，但在我的情况下，我有一个动态页面，它是由对具有不同 POST 参数的相同 url 的多次调用生成的：一旦它返回 html 容器，然后是 PDF 文件，然后是图像，. .. cache.js getUrlCacheFilename() 似乎总是返回相同的缓存文件名 (7/3kjh55ig.d) - 实际上不存在

【解决方案2】：

我发现直到 phantomjs 成熟一点，根据 issue 158 http://code.google.com/p/phantomjs/issues/detail?id=158 这对他们来说有点头疼。

所以你还是想做吗？我选择了更高一点来实现这一点，并在https://github.com/allfro/pymiproxy 获取了 PyMiProxy，下载、安装、设置它，获取他们的示例代码并在 proxy.py 中制作它

from miproxy.proxy import RequestInterceptorPlugin, ResponseInterceptorPlugin, AsyncMitmProxy
from mimetools import Message
from StringIO import StringIO

class DebugInterceptor(RequestInterceptorPlugin, ResponseInterceptorPlugin):

        def do_request(self, data):
            data = data.replace('Accept-Encoding: gzip\r\n', 'Accept-Encoding:\r\n', 1);
            return data

        def do_response(self, data):
            #print '<< %s' % repr(data[:100])
            request_line, headers_alone = data.split('\r\n', 1)
            headers = Message(StringIO(headers_alone))
            print "Content type: %s" %(headers['content-type'])
            if headers['content-type'] == 'text/x-comma-separated-values':
                f = open('data.csv', 'w')
                f.write(data)
            print ''
            return data

if __name__ == '__main__':
    proxy = AsyncMitmProxy()
    proxy.register_interceptor(DebugInterceptor)
    try:
        proxy.serve_forever()
    except KeyboardInterrupt:
        proxy.server_close()

然后我启动它

python proxy.py

接下来我使用指定的代理执行 phantomjs...

phantomjs --ignore-ssl-errors=yes --cookies-file=cookies.txt --proxy=127.0.0.1:8080 --web-security=no myfile.js

你可能想打开你的安全或类似的东西，这对我来说是不必要的，因为我只是在抓取一个来源。您现在应该看到一堆文本流过您的代理控制台，如果它落在 mime 类型为“text/x-comma-separated-values”的东西上，它将保存为 data.csv。这也将保存所有标题和所有内容，但如果您已经走到这一步，我相信您可以弄清楚如何将它们弹出。

另一个细节，我发现我必须禁用 gzip 编码，我可以使用 zlib 并从我自己的 apache 网络服务器解压缩 gzip 中的数据，但如果它来自 IIS 或这样的解压缩会出错我不确定那部分。

所以我的电力公司不会为我提供 API？美好的！我们努力做到这一点！

【讨论】：

【解决方案3】：

没有意识到我可以像这样从文档对象中获取源代码：

casper.start(url, function() {
    var js = this.evaluate(function() {
        return document; 
    }); 
    this.echo(js.all[0].outerHTML); 
});

更多信息here。

【讨论】：

【解决方案4】：

您可以使用Casper.debugHTML() 打印出 HTML 资源的内容：

var casper = require('casper').create();

casper.start('http://google.com/', function() {
    this.debugHTML();
});

casper.run();

您还可以使用 casper.getPageContent() 将 HTML 内容存储在 var 中：http://casperjs.org/api.html#casper.getPageContent（在最新的 master 中可用）

【讨论】：

感谢 NiKo，我想我并不清楚，但我正在寻找所有其他资源，而不是 html 页面。我想把外部的css或者js文件存储在一个var里面，这些资源的内容，可以吗？
只要确保您设置了正确的协议（即 http 与 https）.. 我花了一段时间才发现我试图打开的网站是从 http 重定向到 https.. 并且窒息casperjs（错误？）
@iwek 请参阅此链接以了解有关如何将资源保存到磁盘的更多信息：stackoverflow.com/questions/24582307/…stackoverflow.com/users/1816580/artjom-b 的回答