PHP curl网页抓取突然失败答案

【问题标题】：PHP curl web crawling fail suddenlyPHP curl网页抓取突然失败
【发布时间】：2015-02-27 17:13:36
【问题描述】：

我以前可以成功抓取报纸网站，但今天失败了。

但是我可以使用 firefox 成功访问网络。它只是发生在卷曲中。这意味着它允许我的 IP 访问并且没有被禁止。

这是网络显示的错误

请启用 cookie。

错误 1010 射线 ID：1a17d04d7c4f8888

访问被拒绝

发生了什么？

本网站 (www1.hkej.com) 的所有者已禁止您访问基于在浏览器的签名 (1a17d04d7c4f8888-ua45) 上。

CloudFlare 射线 ID：1a17d04d7c4f8888 • 您的 IP：2xx.1x.1xx.2xx • CloudFlare 的性能和安全性

这是我之前的代码：

$cookieMain = "cookieHKEJ.txt";  // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$cookieMobile = "cookieMobile.txt";  // need to use 2 different cookies since it will overwrite the old one when curl store cookie. cookie file is store under apache folder
$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";

// submit a login
function cLogin($url, $post, $agent, $cookiefile, $referer) {
    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 100);          // follow the location if the web page refer to the other page automatically
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);     // Get returned value as string (don’t put to screen)
    curl_setopt($ch, CURLOPT_USERAGENT, $agent);        // Spoof the user-agent to be the browser that the user is on (and accessing the php script)
    curl_setopt($ch, CURLOPT_COOKIEJAR, $cookiefile);   // Use cookie.txt for STORING cookies
    curl_setopt($ch, CURLOPT_POST, true);                           // Tell curl that we are posting data
    curl_setopt($ch, CURLOPT_POSTFIELDS, $post);            // Post the data in the array above
    curl_setopt($ch, CURLOPT_REFERER, $referer);

    $output = curl_exec($ch);       // execute
    curl_close($ch);

    return $output;
}    

$input = cDisplay("http://www1.hkej.com/dailynews/toc", $agent, $cookieMain);
echo $input;

如何使用 curl 成功假装浏览器？我错过了一些参数吗？

【问题讨论】：

它明确表示拒绝访问。未经他们的许可，您不得抓取该网站，而您刚刚被禁止。
但是我可以使用firefox成功访问网络“www1.hkej.com/dailynews/toc”。它只是发生在 curl 中。
因为它是基于浏览器签名的阻塞（签名可能是由不同的参数构建的）。您的默认 Firefox 具有与 curl 不同的签名。
所以我错过的任何 curl 参数我都无法通过使用 curl 来假装浏览器
他们禁止了你的IP，因为他们不想让你抓取他们的网站。现在要抓取页面，您可能需要更改 ip

标签： php curl web-crawler cloudflare

【解决方案1】：

用户使用 Cloudflares 安全功能来阻止您抓取他们的网站，很可能被显示为恶意机器人。他们将根据您的用户代理和 IP 地址完成此操作。

尝试更改您的 IP（如果是家庭用户，请尝试重新启动路由器。有时会获得不同的 IP 地址）。尝试使用代理并尝试使用 Curl 发送不同的标头。

更重要的是，他们不希望人们抓取他们的网站并影响他们的流量等，您真的应该为此征求许可。

【讨论】：

【解决方案2】：

正如我在帖子中所说，我可以使用firefox访问网络并且我的IP没有被禁止。最后，我从

更改代码后成功了

$agent = "User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:33.0) Gecko/20100101 Firefox/33.0";

到

$agent = $_SERVER['HTTP_USER_AGENT'];

实际上，我不知道为什么当“User-Agent：”从昨天开始存在时它会失败，但之前没问题。

无论如何，谢谢。

【讨论】：