【问题标题】:Problem with puppeteer web scrape Problempuppeteer 网络抓取问题
【发布时间】:2020-02-28 09:27:09
【问题描述】:

我需要抓取 113 个 URL 的列表,以从这些 URL 中收集 TitleImageURLContent 并将它们放入稍后导入的 JSON / 文本文件。

但我似乎无法让它正常工作。我现在已经让循环工作了,就像转到 URL 一样,但返回结果是未定义的,不确定为什么返回的数据没有通过。

我能得到一些帮助吗?

编辑

const puppeteer = require('puppeteer');

let scrape = async (i, url) => {
const browser = await puppeteer.launch({
    headless: false // Show Browser
});

// Load a new page
const page = await browser.newPage();

// Set viewport size
await page.setViewport({ width: 1366, height: 768, deviceScaleFactor: 1 });

// Go to URL
await page.goto(`${url}`, { waitUntil: 'networkidle2' });

// Run the scrape over the page
const results = await page.evaluate(() => {
    // H2 Heading
    let title = document.querySelector('div.wsite-section-elements > h2.wsite-content-title').innerText;
    // Image
    let imageURL = document.querySelector('div.wsite-section-elements > div > div > a> img').getAttribute('src');
    // Paragraph
    let txtContent = document.querySelector('div.wsite-section-elements > div.paragraph').innerText;

});

//Close Browser
await browser.close();

// Return scrape results
return results;
};

(async () => {
// Pages to scrape
let pageURLs = ['https://www.bibleed.com/the-divine-origin-of-the-bible.html','https://www.bibleed.com/the-bible-our-guide.html'];

for(let i = 0; i < pageURLs.length; i++)
{
    await scrape(i, pageURLs[i]).then((value) => {
        console.log(i, ': ', value);
    });
}

// Write to file
//const fs = require('fs');
//fs.writeFileSync('webScrape3.txt', JSON.stringify(result), err => err ? console.log(err): null);
})();

【问题讨论】:

    标签: javascript web puppeteer scrape


    【解决方案1】:

    您在 for 循环内创建了一个 pamphletData 变量,在其外部无法访问该变量。所以当你做JSON.stringify(pamphletData)时,你实际上是在做JSON.stringify(undefined)

    【讨论】:

      【解决方案2】:

      这不是木偶问题。你没有从 dom 返回任何东西。

      const puppeteer = require('puppeteer');
      
      let scrape = async (i, url) => {
      const browser = await puppeteer.launch({
          headless: false // Show Browser
      });
      
      // Load a new page
      const page = await browser.newPage();
      
      // Set viewport size
      await page.setViewport({ width: 1366, height: 768, deviceScaleFactor: 1 });
      
      // Go to URL
      await page.goto(`${url}`, { waitUntil: 'networkidle2' });
      
      // Run the scrape over the page
      const results = await page.evaluate(() => {
          // H2 Heading
          let title = document.querySelector('div.wsite-section-elements > h2.wsite-content-title').innerText;
          // Image
          let imageURL = document.querySelector('div.wsite-section-elements > div > div > a> img').getAttribute('src');
          // Paragraph
          let txtContent = document.querySelector('div.wsite-section-elements > div.paragraph').innerText;
      
          return { title, imageURLm txtContent };
      
      });
      
      //Close Browser
      await browser.close();
      
      // Return scrape results
      return results;
      };
      
      (async () => {
      // Pages to scrape
      let pageURLs = ['https://www.bibleed.com/the-divine-origin-of-the-bible.html','https://www.bibleed.com/the-bible-our-guide.html'];
      
      for(let i = 0; i < pageURLs.length; i++)
      {
          await scrape(i, pageURLs[i]).then((value) => {
              console.log(i, ': ', value);
          });
      }
      
      // Write to file
      //const fs = require('fs');
      //fs.writeFileSync('webScrape3.txt', JSON.stringify(result), err => err ? console.log(err): null);
      })();
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-10-10
        • 1970-01-01
        • 1970-01-01
        • 2019-04-06
        • 2020-03-21
        相关资源
        最近更新 更多