【问题标题】:Nesting async functions using NodeJS and Puppeteer使用 NodeJS 和 Puppeteer 嵌套异步函数
【发布时间】:2018-06-04 18:42:13
【问题描述】:

我正在构建一个小型刮板,它会在搜索结果页面中抓取链接,然后单击每个链接以从结果页面中抓取详细信息。到目前为止,我有两个刮刀。一个抓取结果页面,另一个抓取单个结果的页面。这是结果页面的截断刮板:

const puppeteer = require('puppeteer');
var URLList = new Array;
let scrapeResults = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('www.******.com/search_result');
    await page.waitFor(1000);

    const RESULT_SELECTOR ='#innerLeft ';
    const RESULT_CLASS = 'dspListings2';
    // scrape result page for URLs and put them in global URLList for further processing    
    URLList.push(results);
 browser.close();
};
scrapeResults();

这是单个结果页面的刮板(点击链接后):

var details=''; //to be populated by scrapeListings function
const puppeteer = require('puppeteer');
URLList = [url1, url2, url3] // URLList is populated by the scrapeResults() function

URLList.forEach(async (url) => {
  const scrapeResultDetails = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    await page.goto(url);
    await page.waitFor(1000);

    const RESULT_DETAILS_SELECTOR = '#details_layout > p';
    // scrape for  result details
    // assign result details to global details variable for further processing
    details = resultDetails;
 browser.close();
};
scrapeResultDetails();
});

结果页面返回一个 URL 列表,然后我想将其传递给第二个抓取工具,以便 forEach 循环打开列表中的每个 url 以获取详细信息。

问题 问题是我不能调用第二个刮板,因为它在第一个刮板内。两者都有async wait,这会导致错误。例如,这是我尝试过的,但它不起作用:

const puppeteer = require('puppeteer');
var URLList = new Array;
var details=''; //to be populated by scrapeListings function

let scrapeResults = async () => {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    await page.goto('www.******.com/search_result');
    await page.waitFor(1000);

    const RESULT_SELECTOR ='#innerLeft ';
    const RESULT_CLASS = 'dspListings2';
    // scrape result page for URLs and put them in global URLList for further processing    
    URLList.push(results);

browser.close();

    URLList.forEach(async (url) => {
      const scrapeResultDetails = async () => {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);
        await page.waitFor(1000);
        const RESULT_DETAILS_SELECTOR = '#details_layout > p';
        // scrape for  result details
        // assign result details to global details variable for further processing
        details = resultDetails;
     browser.close();
    };
    scrapeResultDetails();
    });


};
scrapeResults();

有什么想法吗??? 另外,我应该在哪里为循环声明我的全局变量?

【问题讨论】:

    标签: javascript loops web-scraping async-await puppeteer


    【解决方案1】:

    您需要切换到[for-of][1] 循环而不是.forEach,因为它与异步调用完美配合。 此外,您还漏掉了几个 await 声明。

    • 我强烈建议您停止使用全局变量,而只从函数中返回数据。

    请看我的cmets:

    const puppeteer = require('puppeteer');
    
    var URLList = [];
    
    var details=''; //to be populated by scrapeListings function
    
    const scrapeResultDetails = async () => {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto(url);
        await page.waitFor(1000);
        const RESULT_DETAILS_SELECTOR = '#details_layout > p';
    
        //TODO: Global variables are bad, consider returning details from a function.
        details = resultDetails;
    
        //TODO: `await` was missing here
        await browser.close();
    };
    
    let scrapeResults = async () => {
        const browser = await puppeteer.launch();
        const page = await browser.newPage();
        await page.goto('www.******.com/search_result');
        await page.waitFor(1000);
    
        const RESULT_SELECTOR ='#innerLeft ';
        const RESULT_CLASS = 'dspListings2';
    
        URLList.push(results);
    
        // TODO: `await` has been missing.
        await browser.close();
    
        // TODO: Please use for-of loop here, you won't have any async prolems then
        for (let url of URLList) {
            // TODO: `details` is going to be populated after each iterration.
            // TODO: Although consider having `const details = await scrapeResultDetails(); here.
            await scrapeResultDetails();
        }
    };

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2012-10-12
      • 1970-01-01
      • 2018-06-04
      • 2014-10-05
      • 2019-10-02
      • 1970-01-01
      相关资源
      最近更新 更多