【发布时间】:2021-07-24 10:34:38
【问题描述】:
多年来,我一直在使用 curl 解析网站,但我有一些关于网站的未知内容。检查它使用 cloudfires 的返回值并对其进行调查,我发现它使用某种机制来忽略机器人但允许用户。
我不明白它是如何做到这一点的,因为它在任何发送之前都会返回 403 代码,但如果我对 chrome 做同样的事情,它就可以正常工作。
我已经从 chrome 的检查器中测试了“curl to bash 和命令行选项”,结果相同
这是我正在使用的代码:
$headers=array(
'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language: es-ES,es;q=0.9',
'upgrade-insecure-requests: 1',
//'Referrer Policy: strict-origin-when-cross-origin',
//'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36'
);
$agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36";
$url="https://www.pccomponentes.com/";
//$agent= 'Mozilla/5.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$agent = 'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)';
$ch = curl_init();
//curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
//curl_setopt($ch, CURLOPT_HEADER, 0);
//curl_setopt($ch, CURLOPT_POST, 0);
//curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
//curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
//curl_setopt($ch, CURLOPT_MAXREDIRS, 20);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
//curl_setopt($ch, CURLOPT_LOW_SPEED_LIMIT, 1);
//curl_setopt($ch, CURLOPT_LOW_SPEED_TIME, 360);
//curl_setopt($ch, CURLOPT_IGNORE_CONTENT_LENGTH, 1);
//curl_setopt($ch, CURLOPT_TCP_NODELAY, 1);
curl_setopt($ch, CURLOPT_HTTPHEADER,$headers);
curl_setopt($ch, CURLOPT_URL,$url);
$result=curl_exec($ch);
echo "code: ".curl_getinfo($ch,CURLINFO_HTTP_CODE ).PHP_EOL;
//echo $result;
您可以在 cmets 中看到我检查了很多不同的解决方案、不同的代理、不同的 curl 选项,但我总是得到一个 403 代码。
curl 命令行 sh 代码是
curl -I -vvv 'https://www.pccomponentes.com/' \
-H 'authority: www.pccomponentes.com' \
-H 'sec-ch-ua: " Not A;Brand";v="99", "Chromium";v="90", "Google Chrome";v="90"' \
-H 'sec-ch-ua-mobile: ?0' \
-H 'upgrade-insecure-requests: 1' \
-H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36' \
-H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'sec-fetch-site: none' \
-H 'sec-fetch-mode: navigate' \
-H 'sec-fetch-user: ?1' \
-H 'sec-fetch-dest: document' \
-H 'accept-language: es-ES,es;q=0.9' \
--compressed
要检查谷歌浏览器,我打开一个根本没有 cookie 的安全窗口,然后我打开检查器并编写 url。
脚本的输出(和命令行 curl 一样)是
* Trying 104.16.162.71:443...
* TCP_NODELAY set
* Connected to www.pccomponentes.com (104.16.162.71) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
* CAfile: /etc/ssl/certs/ca-certificates.crt
CApath: /etc/ssl/certs
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
* subject: C=US; ST=CA; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
* start date: Aug 11 00:00:00 2020 GMT
* expire date: Aug 11 12:00:00 2021 GMT
* subjectAltName: host "www.pccomponentes.com" matched cert's "*.pccomponentes.com"
* issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
* SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0xaaab008552b0)
> GET /listado/ajax?idShops%5B%5D=0&page=0&order=price-desc>mTitle=Tarjetas%20Gr%C3%A1ficas&idFamilies%5B%5D=6 HTTP/2
Host: www.pccomponentes.com
user-agent: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-language: es-ES,es;q=0.9
upgrade-insecure-requests: 1
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
< HTTP/2 403
< date: Sat, 01 May 2021 09:28:32 GMT
< content-type: text/html; charset=UTF-8
< cf-chl-bypass: 1
< set-cookie: __cfduid=db6d6b293bbc3a77fe7f7b90ec55cebc31619861312; expires=Mon, 31-May-21 09:28:32 GMT; path=/; domain=.pccomponentes.com; HttpOnly; SameSite=Lax
< cache-control: private, max-age=0, no-store, no-cache, must-revalidate, post-check=0, pre-check=0
< expires: Thu, 01 Jan 1970 00:00:01 GMT
< x-frame-options: SAMEORIGIN
< cf-request-id: 09c8db2a8c0000611f910c2000000001
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< server: cloudflare
< cf-ray: 6487faf0d82d611f-BCN
<
* Connection #0 to host www.pccomponentes.com left intact
code: 403
我一直在搜索以下信息:
- 旧 SSL 会话 ID 已过时,正在删除
但没有运气。
它使用了什么样的保护?,我看到了一些关于 js 的东西,但是当它已经返回 403 代码时它甚至没有加载。我看到了一些关于 catpcha 的 cmets,但在发送之前这是不可能的.. chrome 返回代码 200 和 curl 403。
我也尝试过使用 HTTP/1.1,使用不同的编码,使用 gzip 等......完全没有运气。
他们似乎最近更改了系统。欢迎任何提示。
【问题讨论】:
-
您好,只是想确保我理解正确,您使用 chrome 获得 200,但是在 chrome dev 选项中,复制为 cURL,然后运行 curl 命令,您会获得 403?
标签: php curl cloudflare