Xpath 从子页面中抓取更多链接？答案

【问题标题】：Xpath scraping further link from subpages?Xpath 从子页面中抓取更多链接？
【发布时间】：2021-10-15 12:58:11
【问题描述】：

我终于设法用 php 编写了一个脚本，用于从其他网站上抓取基本元素。超级简单。这个例子展示了如何获取title和url。

ini_set('display_errors', 1);

$url = 'http://test123cxqwq12.000webhostapp.com/mainpage.php';

$ch = curl_init();
curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
curl_setopt($ch, CURLOPT_HEADER, 0);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);

$data = curl_exec($ch);
curl_close($ch);

$dom = new DOMDocument();
@$dom->loadHTML($data);

$xpath = new DOMXPath($dom);

$title = $xpath->query('/html/body/a/h1');

$source = $xpath->query('/html/body/a/@href');

for ($i = 0; $i <= count($source)-1; $i++) {
$new = $source[$i]->nodeValue;
$text = $title[$i]->nodeValue;
echo "<a href=".$new." target=_blank><img src=".$text."></a>"."</br>";
}

结果页面：http://test123cxqwq12.000webhostapp.com/scrap.php

要抓取内容的页面：http://test123cxqwq12.000webhostapp.com/mainpage.php

子页面：http://test123cxqwq12.000webhostapp.com/subpage.php

现在我想更进一步，从子页面中获取数据。因此，与其像现在这样从主页获取源代码。我想进入这个来源并从子页面获取另一个来源（在这个例子中是 google.com 链接）。我没主意了。我想请教一些提示，是否可以像我现在做的那样用 xpath 来做？

【问题讨论】：

标签： php web-scraping xpath

【解决方案1】：

我认为解决方案是将 URL 存储在数据库中，然后将您的 Curl 和 xpath 函数应用于它们

<?php

function curlGet($url) {
$ch = curl_init();
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_URL, $url);
$results = curl_exec($ch);
curl_close($ch);
return $results;
}

function returnXPathObject($item) {
    $xmlPageDom = new DomDocument();
    @$xmlPageDom->loadHTML($item);
    $xmlPageXPath = new DOMXPath($xmlPageDom);
    return $xmlPageXPath;
}

$allUrl = $cxn->query("SELECT * FROM yourDatabaseUrl");
$allUrl = $allUrl->fetchAll();   

for ($i = 0; $i<count($allUrl); $i++){
    $url = $allUrl[$i];
    $getDom = curlGet($url);
    $getDomXpath = returnXPathObject($getDom);
    $title = $getDomXpath->query('/html/body/a/h1');
    $source = $getDomXpath->query('/html/body/a/@href');
}

我不确定这个答案只是一个提议

【讨论】：