【发布时间】:2020-07-16 15:55:12
【问题描述】:
我是使用 puppeteer 进行 javascript 和网络抓取的完整初学者,我正在尝试获得简单的欧洲联赛回合的分数
https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019
通过上面的分数列表inspecting,我发现分数列表是一个div 元素,其中包含其他divs 并显示统计信息。
HTML 用于两队之间的单场比赛(此示例下方的比赛有更多 div)
//score list
<div class="wp-module wp-module-asidegames wp-module-5lfarqnjesnirthi">
//the data-code increases to "euro_245" ...
<div class="">
<div class="game played" data-code="euro_244" data-date="1583427600000" data-played="1">
<a href="/main/results/showgame?gamecode=244&seasoncode=E2019" class="game-link">
<div class="club">
<span class="name">Zenit St Petersburg</span>
<span class="score homepts winner">76</span>
</div>
<div class="club">
<span class="name">Zalgiris Kaunas</span>
<span class="score awaypts ">75</span>
</div>
<div class="info">
<span class="date">March 5 18:00 CET</span>
<span class="live">
LIVE <span class="minute"></span>
</span>
<span class="final">
FINAL
</span>
</div>
</a>
</div>
//more teams
</div>
</div>
我想要的是遍历外部 div 元素,并获取参赛球队和每场比赛的得分,并将它们存储在 json 文件中。但是,由于我是一个完整的初学者,我不明白如何遍历上面的 html。 这是我获取元素的网页抓取代码:
const puppeteer = require('puppeteer');
const sleep = (delay) => new Promise((resolve) => setTimeout(resolve,delay));
async function getTeams(url){
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await sleep(3000);
const games = await page.$x('//*[@id="main-one"]/div/div/div/div[1]/div[1]/div[3]');
//this is where I will execute the iteration part to get the matches with their scores
await sleep(2000);
await browser.close();
}
getTeams('https://www.euroleague.net/main/results?gamenumber=28&phasetypecode=RS&seasoncode=E2019');
如果您能指导我完成迭代部分,我将不胜感激。 提前谢谢你
【问题讨论】:
标签: javascript html web-scraping puppeteer