如何使用 Node.js 进行爬网答案

【问题标题】：How to crawling using Node.js如何使用 Node.js 进行爬网
【发布时间】：2021-09-11 04:25:42
【问题描述】：

我不敢相信我在问一个明显的问题，但我仍然在控制台日志中弄错了。

控制台在网站中显示类似“[]”的爬网，但我已经检查了至少 10 次拼写错误。无论如何，这是 javascript 代码。

我想在网站里爬。

这是 kangnam.js 文件：

const axios = require('axios');
const cheerio = require('cheerio');
const log = console.log;

const getHTML = async () => {
    try {
        return await axios.get('https://web.kangnam.ac.kr', {
            headers: {
                Accept: 'text/html'
            }
        });
    } catch (error) {
        console.log(error);
    }
};

getHTML()
    .then(html => {
    let ulList = [];
    const $ = cheerio.load(html.data);
    const $allNotices = $("ul.tab_listl div.list_txt");
    
    $allNotices.each(function(idx, element) {
        ulList[idx] = {
            title : $(this).find("list_txt title").text(),
            url : $(this).find("list_txt a").attr('href')
        };
    });
    
    const data = ulList.filter(n => n.title);
    return data;
}). then(res => log(res));

我已经检查和修改了至少 10 次然而，Js 仍然抛出这个结果：

root@goorm:/workspace/web_platform_test/myapp/kangnamCrawling(master)# node kangnam.js
[]

【问题讨论】：

如果去掉这行：ulList.filter，你会看到什么？
然后显示错误
我在浏览器控制台上执行了相同的代码并得到相同的空结果，因为标题为空
您所追求的内容是否存在于您要检索的 URL 的页面原始源中？还是页面本身开始加载后由 JavaScript 绘制的内容？

标签： javascript node.js web-crawler

【解决方案1】：

伙计，我认为问题在于您的解析不正确。

$allNotices.each(function(idx, element) {
    ulList[idx] = {
        title : $(this).find("list_txt title").text(),
        url : $(this).find("list_txt a").attr('href')
    };
});

您尝试解析的数据位于 $(this) 数组的第一个索引中，该数组实际上只是存储一个 DOM 节点。至于为什么 DOM 以这种方式存储节点，很可能是因为效率和有效性。但是您要查找的所有数据都包含在此 Node 对象中。但是， find() 是肤浅的，仅检查数组的索引是否符合您提供的条件，这是一个字符串搜索。 $(this) 数组只包含一个 Node，而不是一个字符串，所以当你为一个字符串调用 .find() 时，它总是会返回 undefined。

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Array/find

您需要先访问初始索引并在节点上执行属性访问器。您也不需要使用 $(this) 因为您已经获得了与 element 参数相同的确切数据。由于您已经获得了需要处理的数据，因此仅使用元素也更有效。

  $allNotices.each(function(idx, element) {
      ulList[idx] = {
          title : element.children[0].attribs.title,
          url : element.children[0].attribs.href
      };
  });

现在应该可以正确填充您的数据数组了。您应该始终分析您正在解析的数据结构，因为这是您正确解析它们的唯一方法。无论如何，我希望我能解决你的问题！

【讨论】：