【问题标题】:Splash + Scrapoxy: x-cache-proxyname header is missingSplash + Scrapoxy:缺少 x-cache-proxyname 标头
【发布时间】:2018-04-18 20:00:03
【问题描述】:

我正在使用以下基础架构来抓取网站:

Scrapy <--> Splash <--> Scrapoxy <--> web site

我正在通过Splash execute 端点发出请求,使用这样的 Lua 脚本:

function main(splash)
    local host = "..."
    local port = "..."
    local username = "..."
    local password = "..."

    splash:on_request(function (request)
        request:set_proxy{host, port, username=username, password=password}
    end)

    splash:go(splash.args.url)
    return splash:html()
end

我想检测禁令并删除被禁止的代理。根据Scrapoxy documentation

Scrapoxy 在响应中添加一个 HTTP 标头 x-cache-proxyname

但我在response.headers 中没有看到这个标题。唯一的标题是:

{b'Content-Type': b'text/html; charset=utf-8',
 b'Date': b'Wed, 18 Apr 2018 19:02:21 GMT',
 b'Server': b'TwistedWeb/16.1.1'}

我做错了什么?我应该在 Lua 脚本中添加一些内容以正确返回标头吗?


更新: 实际上,这似乎不是 Splash 问题。即使通过 HTTPie 使用,Scrapoxy 也不会返回 x-cache-proxyname

http -v --proxy=https:http://<user>:<password>@<scrapoxy-server>:8888 https://<site>

GET / HTTP/1.1
User-Agent: HTTPie/0.9.9
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Host: <site>


HTTP/1.1 200 OK
Server: nginx
Date: Thu, 28 Jun 2018 08:14:26 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: <...>
X-Powered-By: Express
ETag: W/"5a31b-faPJ7bjKH24S/3EvHU/8IoJHyxw"
Vary: Cookie, User-Agent
Content-Security-Policy: default-src https:; child-src https:; connect-src https: wss:; form-action https:; frame-ancestors https: http://webvisor.com; media-src https:; object-src https:; img-src https: data: blob:; script-src https: data: 'unsafe-inline' 'unsafe-eval'; style-src https: 'unsafe-inline'; font-src https: data:; report-uri /ajax/csp-report/
Content-Encoding: gzip

【问题讨论】:

    标签: python scrapy scrapy-splash splash-js-render


    【解决方案1】:

    我设法用这个 lua 脚本获得了 x-cache-proxyname

    function main(splash)
     local host = "..."
     local port = "..."
     local username = "..."
     local password = "..."
     local proxy = ""
     splash:on_request(function (request)
        request:set_proxy{host, port, username=username, password=password}
     end) 
     splash:on_response_headers(function(response)
        proxy = response.headers["x-cache-proxyname"]
     end)
     splash.images_enabled = false
     splash:go(splash.args.url)
     splash:set_result_header("x-cache-proxyname", proxy)
     splash:go(splash.args.url)
     return splash:html() 
    end
    

    更新: 当您使用 HTTPS 时,scrapyoxy 无法编辑标头并将 x-cache-proxyname 添加到响应中

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-05-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-06-10
      • 2022-12-18
      • 2013-08-19
      相关资源
      最近更新 更多