【问题标题】:Puppeteer can't find elements when Headless TRUEHeadless TRUE 时 Puppeteer 找不到元素
【发布时间】:2021-11-26 00:05:27
【问题描述】:

我在使用 Puppeteer 时遇到了一些问题,我想提取一个项目列表并在 headless 为 FALSE 而不是 TRUE 时成功。

首先,我想在映射之前获取这些元素。

这是我的脚本,也许你可以复制它,它真的很基础。


const chalk = require("chalk");

const baseUrl = "https://www.interencheres.com/recherche/lots?search=";

const searchTerm = "Apple";

const searchUrl = baseUrl + searchTerm;

(async () => {
  const browser = await puppeteer.launch({
    headless: false,
    ignoreHTTPSErrors: true,
    args: [`--window-size=1920,1080`],
    defaultViewport: {
      width: 1920,
      height: 1080,
    },
  });

  const page = await browser.newPage();

  // Begin navigation
  console.log(chalk.yellow("Beginning navigation."));
  await page.goto(searchUrl);

  // Await List of elements;
  console.log(chalk.yellow("Wait for Network Idle..."));
  await page.waitForNetworkIdle();

  // get Items
  const findElements = await page.evaluate(() => {
    const elements = document.querySelectorAll(".sale-item");
    console.log(elements);
    return elements;
  });

  console.log(findElements);

  console.log(chalk.blue("Waiting..."));
  await page.waitForTimeout(10000);

  await browser.close();
  console.log(chalk.red("Closed."));
})();
Expected results : {
  '0': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
  '1': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
  '2': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
  '3': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
  '4': { _prevClass: 'sale-item pa-1 col-sm-6 col-md-4 col-lg-3 col-12' },
   .
   .
}

【问题讨论】:

    标签: javascript web-scraping puppeteer headless-browser


    【解决方案1】:

    对于初学者,我更喜欢 page.waitForSelector(yourSelector) 而不是 page.waitForNetworkIdle();。在大多数情况下,更直接地保证您想要的数据在页面上,而网络空闲可以阻止等待与您尝试抓取的数据完全无关的各种请求。

    一些网站会检查标题以阻止抓取工具。您可以尝试按照 Puppeteer GitHub 问题Different behavior between { headless: false } and { headless: true } #665 中所述添加用户代理标头:

    const puppeteer = require("puppeteer");
    
    const baseUrl = "https://www.interencheres.com/recherche/lots?search=";
    const searchTerm = "Apple";
    const searchUrl = baseUrl + searchTerm;
    
    let browser;
    (async () => {
      browser = await puppeteer.launch({headless: true});
      const [page] = await browser.pages();
      await page.setUserAgent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36");
      await page.goto(searchUrl);
      await page.waitForSelector(".sale-item");
      const elements = await page.$$(".sale-item");
      console.log(elements.length); // => 48
    })()
      .catch(err => console.error(err))
      .finally(() => browser?.close())
    ;
    

    使用Why does headless need to be false for Puppeteer to work? 中描述的puppeteer-extra 是您可以尝试的另一种选择。它还匿名化用户代理标头。

    【讨论】:

    • 你是对的@ggorlen,这个网站可以检测到我的无头用户代理。我刚刚检查了 Puppeteer Extra 和 Stealth。感谢您对 waitForSelector() 函数的建议!
    猜你喜欢
    • 2020-06-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-01-19
    • 2018-12-28
    • 1970-01-01
    • 1970-01-01
    • 2018-08-07
    相关资源
    最近更新 更多