如何使用 puppeteer 从网站获取所有链接答案

【问题标题】：How to get all links from a website with puppeteer如何使用 puppeteer 从网站获取所有链接
【发布时间】：2026-01-14 09:50:01
【问题描述】：

好吧，我想要一种方法来使用 puppeteer 和 for 循环来获取网站上的所有链接并将它们添加到数组中，在这种情况下，我想要的链接不是 html 标签中的链接，它们是直接在源代码中的链接、javascript 文件链接等......我想要这样的东西：

array = [ ]
 for(L in links){
  array.push(L)
   //The code should take all the links and add these links to the array
 }

但是我怎样才能获得对 javascript 样式文件的所有引用以及网站源代码中的所有 URL？我只是找到一个帖子和一个问题，教或展示它如何从标签中获取链接，而不是从源代码中获取所有链接。

假设你想获取这个页面上的所有标签例如：

查看源代码：https://www.nike.com/

如何获取所有脚本标签并返回控制台？我放了view-source:https://nike.com，因为你可以获得脚本标签，我不知道你是否可以在不显示源代码的情况下做到这一点，但我考虑过显示和获取脚本标签，因为这是我的想法，但是我做了不知道怎么弄

【问题讨论】：

赏金是一种利用声誉来宣传问题的方式，但请注意：您会立即失去代表，几乎没有机会找回它。
堆栈溢出不是代码编写服务。请先向我们展示您自己的研究，以及哪些有效以及您遇到的问题。
作为站点，您还指 1 个特定链接（例如 google.com）或所有子链接（例如 google.com 和 google.com/something 等）？
@Tschallacka 我没有代码，我没有找到解释的东西，我问堆栈溢出以获得答案，我没有找到我要找的东西
@ulou 我想从 css javascript 文件等中获取所有链接和子链接以及链接，我希望能够获取源代码中可见的所有链接和子链接

标签： javascript html node.js puppeteer

【解决方案1】：

可以仅使用 node.js 从 URL 获取所有链接，而无需 puppeteer：

主要有两个步骤：

获取 URL 的源代码。
解析链接的源代码。

node.js 中的简单实现：

// get-links.js

///
/// Step 1: Request the URL's html source.
///

axios = require('axios');
promise = axios.get('https://www.nike.com');

// Extract html source from response, then process it:
promise.then(function(response) {
    htmlSource = response.data
    getLinksFromHtml(htmlSource);
});

///
/// Step 2: Find links in HTML source.
///

// This function inputs HTML (as a string) and output all the links within.
function getLinksFromHtml(htmlString) {
    // Regular expression that matches syntax for a link (https://*.com/a/3809435/117030):
    LINK_REGEX = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/gi;

    // Use the regular expression from above to find all the links:
    matches = htmlString.match(LINK_REGEX);

    // Output to console:
    console.log(matches);

    // Alternatively, return the array of links for further processing:
    return matches;
}

示例用法：

$ node get-links.js
[
    'http://www.w3.org/2000/svg',
    ...
    'https://s3.nikecdn.com/unite/scripts/unite.min.js',
    'https://www.nike.com/android-icon-192x192.png',
    ...
    'https://connect.facebook.net/',
... 658 more items
]

注意事项：

为了简单起见，我使用了 axios 库并避免来自 nike.com 的“拒绝访问”错误。可以使用任何其他方法来获取 HTML 源，例如：
- 本机 node.js http/https 库
- 木偶师 (Get complete web page source html with puppeteer - but some part always missing)

【讨论】：

【解决方案2】：

是的，您可以在不打开查看源代码的情况下获取所有脚本标签及其链接。您需要在项目中为 jsdom 库添加依赖项，然后将 HTML 响应传递给其实例，如下所示

代码如下：

const axios = require('axios');
const jsdom = require("jsdom");

// hit simple HTTP request using axios or node-fetch as you wish
const nikePageResponse = await axios.get('https://www.nike.com');

// now parse this response into a HTML document using jsdom library
const dom = new jsdom.JSDOM(nikePageResponse.data);
const nikePage = dom.window.document

// now get all the script tags by querying this page
let scriptLinks = []
nikePage.querySelectorAll('script[src]').forEach( script => scriptLinks.push(script.src.trim()));
console.debug('%o', scriptLinks)

在这里，我为 <script> 标签制作了 CSS 选择器，其中包含 src 属性。

您可以使用 puppeteer 编写相同的代码，但打开浏览器和所有内容然后获取其 pageSource 需要一些时间。

您可以使用它来查找链接，然后使用 puppeteer 或其他任何工具对它们进行任何操作。

【讨论】：

【解决方案3】：

尽管其他答案适用于许多情况，但它们不适用于客户端呈现的网站。例如，如果您只是向 Reddit 发出 Axios 请求，您将得到的只是几个带有一些元数据的 div。由于 Puppeteer 实际获取页面并在真实浏览器中解析所有 JavaScript，因此网站对文档呈现的选择与提取页面数据无关。

Puppeteer 在页面对象上有一个 evaluate 方法，允许您直接在页面上运行 JavaScript。使用它，您可以轻松提取所有链接，如下所示：

const puppeteer = require('puppeteer');

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('https://example.com');
  
  const pageUrls = await page.evaluate(() => {
    const urlArray = Array.from(document.links).map((link) => link.href);
    const uniqueUrlArray = [...new Set(urlArray)];
    return uniqueUrlArray;
  });

  console.log(pageUrls);
 
  await browser.close();
})();

【讨论】：