【问题标题】:Puppeteer element is console.log'able but return undefined in puppeteerPuppeteer 元素是 console.log'able 但在 puppeteer 中返回 undefined
【发布时间】:2020-05-27 11:28:23
【问题描述】:

我正在尝试抓取在a 标记下具有h3 标记的网页。我得到了a 标签就好了,但是当我试图得到h3 的innerText 时,我得到了一个undefined 值。

这就是我要抓取的内容:

const puppeteer = require('puppeteer');
const pageURL = "https://producthunt.com";

const webScraping = async pageURL => {
    const browser = await puppeteer.launch({
        headless: false,
        arges: ["--no-sandbox"]
    });
    const page = await browser.newPage();
    let dataObj = {};

    try {
        await page.goto(pageURL, { waitUntil: 'networkidle2' });

        const publishedNews = await page.evaluate(() => {
            const newsDOM = document.querySelectorAll("main ul li");

            let newsList = [];
            newsDOM.forEach(linkElement => {
                const text = linkElement.querySelector("a").textContent;
                const innerText = linkElement.querySelector("a").innerText;
                const url = linkElement.querySelector("a").getAttribute('href');

                const title = linkElement.querySelector("h3").innerText;
                console.log(title);

                newsList.push({
                    title,
                    text,
                    url
                });
            });
            return newsList;
        });

        dataObj = {
            amount: publishedNews.length,
            publishedNews
        };

    } catch (e) {
        console.log(e);
    }

    console.log(dataObj);
    browser.close();
    return dataObj;
};

webScraping(pageURL).catch(console.error);

控制台日志运行良好,但 puppeteer 抛出:

Cannot read property 'innerText' of null

【问题讨论】:

    标签: web-scraping web-crawler puppeteer domcrawler


    【解决方案1】:

    看起来您的解决方案运行良好,但您无法控制 h3 标记是否为空。尝试在访问 innerText 属性之前添加 if 语句,或使用我在下面留下的代码。

    const puppeteer = require('puppeteer');
    const pageURL = "https://producthunt.com";
    
    const webScraping = async pageURL => {
        const browser = await puppeteer.launch({
            headless: false,
            arges: ["--no-sandbox"]
        });
        const page = await browser.newPage();
        let dataObj = {};
    
        try {
            await page.goto(pageURL, { waitUntil: 'networkidle2' });
    
            const publishedNews = await page.evaluate(() => {
                let newsList = [];
                const newsDOM = document.querySelectorAll("main ul li");
    
                newsDOM.forEach(linkElement => {
                    const aTag = linkElement.querySelector("a");
    
                    const text = aTag.textContent;
                    const innerText = aTag.innerText;
                    const url = aTag.getAttribute('href');
    
                    let title = aTag.querySelector("h3");
                    // there may be some <a> without an h3, control
                    // the null pointer exception here, accessing only
                    // if title is not 'null'.
                    if (title) title = title.innerText;
    
                    console.log(title);
    
                    // changed the object structure to add a key for each attr
                    newsList.push({
                        title: title,
                        text: text,
                        url: url
                    });
                });
    
                return newsList;
            });
    
            // changed the object structure to add a key for the array
            dataObj = {
                amount: publishedNews.length,
                list: publishedNews
            };
    
        } catch (e) {
            console.log(e);
        }
    
        console.log({receivedData: dataObj});
        browser.close();
        return dataObj;
    };
    
    webScraping(pageURL).catch(console.error);
    
    

    如果这能解决您的问题,请告诉我!

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-07-03
      • 1970-01-01
      • 2020-10-07
      • 2019-12-25
      • 2018-12-28
      • 1970-01-01
      • 2022-10-15
      相关资源
      最近更新 更多