【发布时间】:2018-06-04 18:42:13
【问题描述】:
我正在构建一个小型刮板,它会在搜索结果页面中抓取链接,然后单击每个链接以从结果页面中抓取详细信息。到目前为止,我有两个刮刀。一个抓取结果页面,另一个抓取单个结果的页面。这是结果页面的截断刮板:
const puppeteer = require('puppeteer');
var URLList = new Array;
let scrapeResults = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('www.******.com/search_result');
await page.waitFor(1000);
const RESULT_SELECTOR ='#innerLeft ';
const RESULT_CLASS = 'dspListings2';
// scrape result page for URLs and put them in global URLList for further processing
URLList.push(results);
browser.close();
};
scrapeResults();
这是单个结果页面的刮板(点击链接后):
var details=''; //to be populated by scrapeListings function
const puppeteer = require('puppeteer');
URLList = [url1, url2, url3] // URLList is populated by the scrapeResults() function
URLList.forEach(async (url) => {
const scrapeResultDetails = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(1000);
const RESULT_DETAILS_SELECTOR = '#details_layout > p';
// scrape for result details
// assign result details to global details variable for further processing
details = resultDetails;
browser.close();
};
scrapeResultDetails();
});
结果页面返回一个 URL 列表,然后我想将其传递给第二个抓取工具,以便 forEach 循环打开列表中的每个 url 以获取详细信息。
问题
问题是我不能调用第二个刮板,因为它在第一个刮板内。两者都有async wait,这会导致错误。例如,这是我尝试过的,但它不起作用:
const puppeteer = require('puppeteer');
var URLList = new Array;
var details=''; //to be populated by scrapeListings function
let scrapeResults = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('www.******.com/search_result');
await page.waitFor(1000);
const RESULT_SELECTOR ='#innerLeft ';
const RESULT_CLASS = 'dspListings2';
// scrape result page for URLs and put them in global URLList for further processing
URLList.push(results);
browser.close();
URLList.forEach(async (url) => {
const scrapeResultDetails = async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(1000);
const RESULT_DETAILS_SELECTOR = '#details_layout > p';
// scrape for result details
// assign result details to global details variable for further processing
details = resultDetails;
browser.close();
};
scrapeResultDetails();
});
};
scrapeResults();
有什么想法吗??? 另外,我应该在哪里为循环声明我的全局变量?
【问题讨论】:
标签: javascript loops web-scraping async-await puppeteer