【发布时间】:2021-02-18 17:17:02
【问题描述】:
所以我一直在做一个爬虫项目。
现在我已经实现了很多东西,但我一直坚持这一点。
所以首先让我解释一下工作流程:在 scraping-service 模块中调用 Scraper,在那里我等待调用的函数的承诺得到解决。数据在抓取工具中获取,并传递给 data_functions 对象,其中数据:合并、验证并插入到数据库中。
现在是代码:
scraping-service
const olxScraper = require('./scrapers/olx-scraper');
const santScraper = require('./scrapers/sant-scraper');
//Calling scraper from where we want to get data about apartments
const data_functions = require('./data-functions/dataF');
let count = 1;
Promise.all([
olxScraper.olxScraper(count),
santScraper.santScraper(count),
]).then(() => data_functions.validateData(data_functions.mergedApartments));
所以我在这里等待这两个函数的promise,然后将合并后的数据传递给data_functions中的validateData方法。
这是刮板:
const axios = require('axios'); //npm package - promise based http client
const cheerio = require('cheerio'); //npm package - used for web-scraping in server-side implementations
const data_functions = require('../data-functions/dataF');
//olxScraper function which as paramater needs count which is sent in the scraping-service file.
exports.olxScraper = async (count) => {
const url = `https://www.olx.ba/pretraga?vrsta=samoprodaja&kategorija=23&sort_order=desc&kanton=9&sacijenom=sacijenom&stranica=${count}`;
//url where data is located at.
const olxScrapedData = [];
try {
await load_url(url, olxScrapedData); //pasing the url and empty array
} catch (error) {
console.log(error);
}
};
//Function that does loading URL part of the scraper, and starting of process for fetching raw data.
const load_url = async (url, olxScrapedData) => {
await axios.get(url).then((response) => {
const $ = cheerio.load(response.data);
fetch_raw_html($).each((index, element) => {
process_single_article($, index, element, olxScrapedData);
});
process_fetching_squaremeters(olxScrapedData); // if i place
//data_functions.mergeData(olxScrapedData); here it will work
});
};
//Part where raw html data is fetched but in div that we want.
const fetch_raw_html = ($) => {
return $('div[id="rezultatipretrage"] > div')
.not('div[class="listitem artikal obicniArtikal i index"]')
.not('div[class="obicniArtikal"]');
};
//Here is all logic for getting data that we want, from the raw html.
const process_single_article = ($, index, element, olxScrapedData) => {
$('span[class="prekrizenacijena"]').remove();
const getLink = $(element).find('div[class="naslov"] > a').attr('href');
const getDescription = $(element).find('div[class="naslov"] > a > p').text();
const getPrice = $(element)
.find('div[class="datum"] > span')
.text()
.replace(/\.| ?KM$/g, '')
.replace(' ', '');
const getPicture = $(element).find('div[class="slika"] > img').attr('src');
//making array of objects with data that is scraped.
olxScrapedData[index] = {
id: getLink.substring(27, 35),
link: getLink,
description: getDescription,
price: parseFloat(getPrice),
picture: getPicture,
};
};
//Square meters are needed to be fetched for every single article.
//This function loads up all links in the olxScrapedData array, and updating objects with square meters value for each apartment.
const process_fetching_squaremeters = (olxScrapedData) => {
const fetchSquaremeters = Promise.all(
olxScrapedData.map((item) => {
return axios.get(item.link).then((response) => {
const $ = cheerio.load(response.data);
const getSquaremeters = $('div[class="df2 "]')
.first()
.text()
.replace('m2', '')
.replace(',', '.')
.split('-')[0];
item.squaremeters = Math.round(getSquaremeters);
item.pricepersquaremeter = Math.round(
parseFloat(item.price) / parseFloat(getSquaremeters)
);
});
})
);
fetchSquaremeters.then(() => {
data_functions.mergeData(olxScrapedData); //Sending final array to mergeData function.
return olxScrapedData;
});
};
现在,如果我在fetchSquaremeters.then 中使用console.log(olxScrapedData),它会输出刮掉的公寓,但它不想调用函数data_functions.mergeData(olxScrapedData)。但是如果我在load_url中添加那个块,它会触发函数和数据被合并,但是没有平方米的东西,我真的需要那个数据。
所以我的问题是,如何做到这一点?我需要在其他地方调用函数吗?
我想要的只是将最后一个 olxScrapedData 发送到这个函数 mergeData 以便将来自不同刮板的数组合并为一个。
谢谢!
编辑:这也是其他刮板的外观:https://jsfiddle.net/oh03mp8t/。请注意,在这个刮板中没有任何承诺。
【问题讨论】:
标签: javascript node.js promise