返回站点地图网址答案

【问题标题】：Returning a sitemap urls返回站点地图网址
【发布时间】：2016-12-12 18:19:31
【问题描述】：

我正在尝试返回网站站点地图中提供的所有 URL，例如 Argos。获得这些 URL 后，我需要重复此过程以返回结果 URL 可能包含的任何 URL。例如：

http://www.argos.co.uk/sitemap.xml 返回：

http://www.argos. co.uk/product.xml
http://www.argos. co.uk/product2.xml
http://www.argos. co.uk/catalogue.xml
http://www.argos. co.uk/buyers_guides.xml
http://www.argos. co.uk/features_and_articles.xml
http://www.argos. co.uk/static_pages.xml
http://www.argos. co.uk/store_pages.xml

http://www.argos.co.uk/product.xml 然后包含我需要的它自己的链接（然后重复此过程，直到到达一个不包含更多可用 xml URL 的页面）

到目前为止：

var urls = require('sitemap-urls'); //package to return xml links from sitemap
var cheerio = require('cheerio');
var request = require('request')

// Returns all xml urls located within page source
request('http://www.argos.co.uk/sitemap.xml', function (error, response, html) {
  var sitemap = html;
  var results = urls.extractUrls(sitemap);

// If results returned, loop to make sitemap equal each url until array end
    if(results) {
    for(i = 0; i < results.length; i++) {
        sitemap = results[i]
        console.log(sitemap)

    // Need to repeat url return process for each url returned


    }
  }                                                                                         
});

可能有一个我忽略的简单解决方案，任何帮助将不胜感激，谢谢。

【问题讨论】：

标签： javascript node.js xml request sitemap

【解决方案1】：

我想你要找的是蜘蛛

<?php
function crawl_page($url, $depth = 5)
{
    static $seen = array();
    if (isset($seen[$url]) || $depth === 0) {
        return;
    }

    $seen[$url] = true;

    $dom = new DOMDocument('1.0');
    @$dom->loadHTMLFile($url);

    $anchors = $dom->getElementsByTagName('a');
    foreach ($anchors as $element) {
        $href = $element->getAttribute('href');
    if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }
                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        crawl_page($href, $depth - 1);
    }
    echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
}
crawl_page("http://hobodave.com", 2);

【讨论】：

谢谢，但它需要在 JavaScript 中。抱歉没有说清楚