【问题标题】:Web automation using PHP/cURL使用 PHP/cURL 实现 Web 自动化
【发布时间】:2012-01-10 19:44:12
【问题描述】:

我想访问这个list 的几个页面。我的尝试是下面的代码,我使用它获取包含第一页酒店数据的 xml 文件,但我想访问其余酒店所在的页面..怎么做?

您可以想象所有页面的 url 都是相同的。

<?php

//extract data from the post
extract($_POST);

//set POST variables
$url = 'http://www.turismovenezia.it/index.php';

$fields1 = array(
            'ajax'=>'searchEngineTopdata',
            'next_pair'=>'Dove Allogiare|*',
            'lang'=>'it');


$fields2 = array(

'ajax'=>'xmlSearchEngineResponder',
'xml' => "%3C%3Fxml%20version%3D%221.0%22%3F%3E%3CSearchRequest%20xmlns%3D%22http%3A%2F%2Fwww.liberologico.com%2Fdbsite%2Fjolly-search%22%20xmlns%3Axsi%3D%22http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance%22%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FScope%3E%3CFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Eaptve_territorio%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3ETHESAURUS%3C%2FMode%3E%3COperation%3ELIKE%3C%2FOperation%3E%3C%2FFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Efull_text_search%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3EFREE_TEXT%3C%2FMode%3E%3COperation%3ELIKE%3C%2FOperation%3E%3C%2FFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Elang%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5Bit%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3EFREE_TEXT%3C%2FMode%3E%3COperation%3EEQUAL%3C%2FOperation%3E%3C%2FFilters%3E%3C%2FFilters%3E%3CSubSearches%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BEventi%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BArte%20%26%20Cultura%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BMare%20%26%20Natura%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BPiatti%20%26%20Prodotti%20tipici%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BRelax%20%26%20Divertimento%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BDove%20Alloggiare%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BDove%20Mangiare%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BInformazioni%20Utili%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3C%2FSubSearches%3E%3C%2FSearch%3E%3CActiveResultSet%3E%3CTab%3E%3C%21%5BCDATA%5BDove%20Alloggiare%5D%5D%3E%3C%2FTab%3E%3CFirstItem%3E0%3C%2FFirstItem%3E%3CPagerSize%3E10%3C%2FPagerSize%3E%3C%2FActiveResultSet%3E%3C%2FSearchRequest%3E",
'force' => 'false');

//open connection
$ch = curl_init();

//set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST, true);
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

//set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST, true);
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields2);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

//execute post
$result = curl_exec($ch);

echo $result;

//close connection
curl_close($ch);

贾维

【问题讨论】:

标签: php curl scripting


【解决方案1】:

首先你为什么要配置两次 curl?

//set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST, true);
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields1);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

//set the url, number of POST vars, POST data
curl_setopt($ch,CURLOPT_URL,$url);
curl_setopt($ch,CURLOPT_POST, true);
curl_setopt($ch,CURLOPT_POSTFIELDS, $fields2);
curl_setopt($ch,CURLOPT_RETURNTRANSFER,1);

然后,当您使用 $fields2(长 XML 字符串)第一次调用第一页时,该调用应返回一个 XML(如您所说),并且在此 XML 响应中有一个字段 ItemCount,其中包含酒店。

如果您查看使用 $fields2 发送的长 XML 字符串,则会有一个字段调用“FirstItem”,其中第一次调用包含 0。该字段是您的偏移量,您可以将其递增以跳过酒店。

例子:

$fields2 = array(

'ajax'=>'xmlSearchEngineResponder',
'xml' => "%3C%3Fxml%20version%3D%221.0%22%3F%3E%3CSearchRequest%20xmlns%3D%22http%3A%2F%2Fwww.liberologico.com%2Fdbsite%2Fjolly-search%22%20xmlns%3Axsi%3D%22http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance%22%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FScope%3E%3CFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Eaptve_territorio%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3ETHESAURUS%3C%2FMode%3E%3COperation%3ELIKE%3C%2FOperation%3E%3C%2FFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Efull_text_search%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3EFREE_TEXT%3C%2FMode%3E%3COperation%3ELIKE%3C%2FOperation%3E%3C%2FFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Elang%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5Bit%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3EFREE_TEXT%3C%2FMode%3E%3COperation%3EEQUAL%3C%2FOperation%3E%3C%2FFilters%3E%3C%2FFilters%3E%3CSubSearches%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BEventi%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BArte%20%26%20Cultura%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BMare%20%26%20Natura%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BPiatti%20%26%20Prodotti%20tipici%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BRelax%20%26%20Divertimento%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BDove%20Alloggiare%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BDove%20Mangiare%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BInformazioni%20Utili%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3C%2FSubSearches%3E%3C%2FSearch%3E%3CActiveResultSet%3E%3CTab%3E%3C%21%5BCDATA%5BDove%20Alloggiare%5D%5D%3E%3C%2FTab%3E%3CFirstItem%3E0%3C%2FFirstItem%3E%3CPagerSize%3E10%3C%2FPagerSize%3E%3C%2FActiveResultSet%3E%3C%2FSearchRequest%3E",
'force' => 'false');

返回前 10 个结果;

$fields2 = array(
'ajax'=>'xmlSearchEngineResponder',
'xml' => "%3C%3Fxml%20version%3D%221.0%22%3F%3E%3CSearchRequest%20xmlns%3D%22http%3A%2F%2Fwww.liberologico.com%2Fdbsite%2Fjolly-search%22%20xmlns%3Axsi%3D%22http%3A%2F%2Fwww.w3.org%2F2001%2FXMLSchema-instance%22%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FScope%3E%3CFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Eaptve_territorio%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3ETHESAURUS%3C%2FMode%3E%3COperation%3ELIKE%3C%2FOperation%3E%3C%2FFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Efull_text_search%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5B%2A%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3EFREE_TEXT%3C%2FMode%3E%3COperation%3ELIKE%3C%2FOperation%3E%3C%2FFilters%3E%3CFilters%20xsi%3Atype%3D%22FilterSpecType%22%3E%3CField%3Elang%3C%2FField%3E%3CValue%3E%3CSingleValue%3E%3C%21%5BCDATA%5Bit%5D%5D%3E%3C%2FSingleValue%3E%3C%2FValue%3E%3CMode%3EFREE_TEXT%3C%2FMode%3E%3COperation%3EEQUAL%3C%2FOperation%3E%3C%2FFilters%3E%3C%2FFilters%3E%3CSubSearches%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BEventi%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BArte%20%26%20Cultura%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BMare%20%26%20Natura%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BPiatti%20%26%20Prodotti%20tipici%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BRelax%20%26%20Divertimento%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BDove%20Alloggiare%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BDove%20Mangiare%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3CSearch%3E%3CScope%3E%3C%21%5BCDATA%5BInformazioni%20Utili%5D%5D%3E%3C%2FScope%3E%3C%2FSearch%3E%3C%2FSubSearches%3E%3C%2FSearch%3E%3CActiveResultSet%3E%3CTab%3E%3C%21%5BCDATA%5BDove%20Alloggiare%5D%5D%3E%3C%2FTab%3E%3CFirstItem%3E10%3C%2FFirstItem%3E%3CPagerSize%3E10%3C%2FPagerSize%3E%3C%2FActiveResultSet%3E%3C%2FSearchRequest%3E",
'force' => 'false');

将返回您接下来的 10 个结果。还有一个字段调用 PagerSize 可以让您一次检索更多结果。

所以我会第一次调用来获取酒店的总数,然后循环获取所有其他页面。

//do first curl call

//then
$totalHotel = 5214; // To retrieve in the first call
$increment = 10; // the number of hotel to treat at once
$nbOfHotelimported = $increment;
while($totalHotel-$increment){
 // do another curl call
 // with FirstItem set to $nbOfHotelimported
 // and pageSizer set to $increment

 $nbOfHotelimported += $increment;
}

【讨论】:

    【解决方案2】:

    您应该获取网站本身使用 AJAX 访问的 URL,而不是抓取初始 HTML 页面本身。您可以使用浏览器的开发者工具窥探在您请求另一个结果页面时发出的 AJAX 请求并“复制”它们,从而准确了解您应该如何构建请求。

    顺便说一句,这取决于为什么以及如何做这可能不是最道德或合法的事情。

    【讨论】:

      猜你喜欢
      • 2011-10-15
      • 2013-02-13
      • 1970-01-01
      • 2022-01-07
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-10-10
      • 2021-09-28
      相关资源
      最近更新 更多