【问题标题】:Returning a sitemap urls返回站点地图网址
【发布时间】:2016-12-12 18:19:31
【问题描述】:

我正在尝试返回网站站点地图中提供的所有 URL,例如 Argos。获得这些 URL 后,我需要重复此过程以返回结果 URL 可能包含的任何 URL。例如:

http://www.argos.co.uk/sitemap.xml 返回:

http://www.argos. co.uk/product.xml
http://www.argos. co.uk/product2.xml
http://www.argos. co.uk/catalogue.xml
http://www.argos. co.uk/buyers_guides.xml
http://www.argos. co.uk/features_and_articles.xml
http://www.argos. co.uk/static_pages.xml
http://www.argos. co.uk/store_pages.xml

http://www.argos.co.uk/product.xml 然后包含我需要的它自己的链接(然后重复此过程,直到到达一个不包含更多可用 xml URL 的页面)

到目前为止:

var urls = require('sitemap-urls'); //package to return xml links from sitemap
var cheerio = require('cheerio');
var request = require('request')

// Returns all xml urls located within page source
request('http://www.argos.co.uk/sitemap.xml', function (error, response, html) {
  var sitemap = html;
  var results = urls.extractUrls(sitemap);

// If results returned, loop to make sitemap equal each url until array end
    if(results) {
    for(i = 0; i < results.length; i++) {
        sitemap = results[i]
        console.log(sitemap)

    // Need to repeat url return process for each url returned


    }
  }                                                                                         
});

可能有一个我忽略的简单解决方案,任何帮助将不胜感激,谢谢。

【问题讨论】:

    标签: javascript node.js xml request sitemap


    【解决方案1】:

    我想你要找的是蜘蛛

    <?php
    function crawl_page($url, $depth = 5)
    {
        static $seen = array();
        if (isset($seen[$url]) || $depth === 0) {
            return;
        }
    
        $seen[$url] = true;
    
        $dom = new DOMDocument('1.0');
        @$dom->loadHTMLFile($url);
    
        $anchors = $dom->getElementsByTagName('a');
        foreach ($anchors as $element) {
            $href = $element->getAttribute('href');
        if (0 !== strpos($href, 'http')) {
                $path = '/' . ltrim($href, '/');
                if (extension_loaded('http')) {
                    $href = http_build_url($url, array('path' => $path));
                } else {
                    $parts = parse_url($url);
                    $href = $parts['scheme'] . '://';
                    if (isset($parts['user']) && isset($parts['pass'])) {
                        $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                    }
                    $href .= $parts['host'];
                    if (isset($parts['port'])) {
                        $href .= ':' . $parts['port'];
                    }
                    $href .= $path;
                }
            }
            crawl_page($href, $depth - 1);
        }
        echo "URL:",$url,PHP_EOL,"CONTENT:",PHP_EOL,$dom->saveHTML(),PHP_EOL,PHP_EOL;
    }
    crawl_page("http://hobodave.com", 2);
    

    【讨论】:

    • 谢谢,但它需要在 JavaScript 中。抱歉没有说清楚
    猜你喜欢
    • 2020-03-31
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-03-09
    • 2020-04-06
    • 2014-09-24
    相关资源
    最近更新 更多