使用 XPath 和 PHP 抓取 HTML 页面答案

【问题标题】：Scraping HTML page using XPath and PHP使用 XPath 和 PHP 抓取 HTML 页面
【发布时间】：2017-08-11 14:11:55
【问题描述】：

我正在尝试使用此 PHP 代码抓取 HTML 页面

<?php
    ini_set('display_errors', 1);

    $url = 'http://www.cittadellasalute.to.it/index.php?option=com_content&view=article&id=6786:situazione-pazienti-in-pronto-soccorso&catid=165:pronto-soccorso&Itemid=372';


    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    $greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');


    foreach( $greenWaitingNumber as $node )
    {
      echo "Number first green line: " .$node->nodeValue;
      echo '<br>';
      echo '<br>';
    }


?>

一切正常（没有错误，在我的浏览器控制台中，我可以看到“200”作为返回码……），但我的 HTML 页面中没有打印任何内容……。

可能问题在于 xpath /html/body/div/div/div[4]/div[3]/section/p 指的是源 HTML 页面中的第一条绿线，但这是我的 Firefox Firebug 告诉我该页面部分....

建议/例子？

！！！更新！！！！

正如 Santosh Sapkota 在他的回复中建议的那样，第一个问题是那个绿色框中的文本是从 iFrame 加载的......我已经看到了 IFrame 广告中 HTML 页面的 url，所以我试图在我的代码中使用这个，现在是......

<?php
    ini_set('display_errors', 1);

    $url = 'http://listeps.cittadellasalute.to.it/?id=01090101';


    //#Set CURL parameters: pay attention to the PROXY config !!!!
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
    curl_setopt($ch, CURLOPT_PROXY, '');
    $data = curl_exec($ch);
    curl_close($ch);

    $dom = new DOMDocument();
    @$dom->loadHTML($data);

    $xpath = new DOMXPath($dom);

    $greenWaitingNumber = $xpath->query('/html/body/div/div/div[4]/div[3]/section/p');


    foreach( $greenWaitingNumber as $node )
    {
      echo "Number first green line: " .$node->nodeValue;
      echo '<br>';
      echo '<br>';
    }


?>

但不幸的是，我的输出 HTML 页面中仍然没有打印任何内容....

其他建议/示例？

【问题讨论】：

标签： php xpath web-scraping

【解决方案1】：

你的 xpath 一定有问题。以及检查是否有从 iFrame 加载的内容。

【讨论】：

当有 iframe 时如何获得正确的 xpath？
如果您尝试在绿色框中获取文本，您可以清楚地看到它是从 iFrame 加载的。