【发布时间】:2022-01-06 14:50:32
【问题描述】:
我正在尝试使用节点和 Cheerio 抓取网页。除了hrefs之外,一切都按我的预期返回。
我已成功返回“标题”.find('h3').text() 和“描述”.find('a').text() 的值,但“链接”.find('a').attr('href'); 仅返回第一个值。这让我感到困惑,因为文本“描述”在同一个锚点内。
我发现,如果我删除 .attr('href'); 并返回 .find('a'),链接文本 (href) 将按预期显示。如果需要,我可以修改返回的值并使其工作,但更愿意正确地执行此操作。
脚本:
const cheerio = require("cheerio");
const axios = require("axios");
axios.get("http://localhost:8000/sample_page_2.html").then(urlResponse => {
const $ = cheerio.load(urlResponse.data);
$('div.tos-post-type').each((i, element) => {
const header = $(element)
.find('h3')
.text()
.trim();
console.log('------------------------------------------------------------------------------------');
console.log('HEADER: ' + header);
const link = $(element)
.find('a')
.attr('href');
console.log('\nLINK(s): \n' + link);
const description = $(element)
.find('a')
.text();
console.log('\nDESCRIPTION(s): \n' + description + '\n');
console.log('------------------------------------------------------------------------------------');
});
});
这是我要抓取的页面的 sn-p:
<div class="container tos-archive">
<div class="row justify-content-center">
<div class="col-lg-10">
<div class="row">
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/legal.svg )"></div>
<h3>
Legal </h3>
<a href="https://www.example_domain.com/legal/terms-conditions/">
Terms & Conditions </a>
<a href="https://www.example_domain.com/legal/service-providers/">
Service Providers </a>
</div>
</div>
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/policy.svg )"></div>
<h3>
Policies </h3>
<a target="" href="https://www.example_domain.com/privacy-policy/">
Privacy Policy </a>
<a target="" href="https://store.example_domain.com/EXHM/store?Action=DisplayEXCookiesPolicyPage">
Cookie Policy </a>
</div>
</div>
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/clip-dark.svg )"></div>
<h3>
<a href="https://www.example_domain.com/compliance/">
Compliance </a>
</h3>
<a href="https://www.example_domain.com/compliance/ccpa/">
California Consumer Privacy Act (CCPA) </a>
<a href="https://www.example_domain.com/compliance/disaster-recovery/">
Disaster Recovery </a>
<a href="https://www.example_domain.com/compliance/gdpr/">
GDPR </a>
<a href="https://www.example_domain.com/compliance/pci-dss/">
PCI DSS </a>
<a href="https://www.example_domain.com/compliance/privacymark/">
PrivacyMark </a>
<a class="tos-view-all" href="https://www.example_domain.com/compliance/">
View All </a>
</div>
</div>
<div class="col-lg-6">
<div class="tos-post-type" style="background-image: url(https://www.example_domain.com/wp-content/hero-pattern.png)">
<div class="icon" style="background-image: url( https://www.example_domain.com/wp-content/mouse.svg )"></div>
<h3>
Other </h3>
<a href="https://www.example_domain.com/legal-other/eu-standard-solutions/">
EU Standard Solutions </a>
<a href="https://www.example_domain.com/legal-other/eu-standard-service-providers/">
EU Standard Service Providers </a>
<a href="https://www.example_domain.com/legal-other/data-exhibit/">
Data Exhibit </a>
<a href="https://www.example_domain.com/legal-other/data-standards/">
Data Standards </a>
<a href="https://www.example_domain.com/legal-other/payment-addenda/">
Payment Addenda </a>
</div>
</div>
</div>
</div>
</div>
</div>
以下是实际结果的 sn-p:
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Policies
LINK(s):
https://www.example_domain.com/privacy-policy/
DESCRIPTION(s):
Privacy Policy
Cookie Policy
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Compliance
LINK(s):
https://www.example_domain.com/compliance/
DESCRIPTION(s):
Compliance
California Consumer Privacy Act (CCPA)
Disaster Recovery
GDPR
PCI DSS
PrivacyMark
View All
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
这是我所期待的(多个链接):
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Policies
LINK(s):
https://www.example_domain.com/privacy-policy/
https://store.example_domain.com/EXHM/store?Action=DisplayEXCookiesPolicyPage
DESCRIPTION(s):
Privacy Policy
Cookie Policy
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
HEADER: Compliance
LINK(s):
https://www.example_domain.com/compliance/
https://www.example_domain.com/compliance/ccpa/
https://www.example_domain.com/compliance/disaster-recovery/
https://www.example_domain.com/compliance/gdpr/
https://www.example_domain.com/compliance/pci-dss/
https://www.example_domain.com/compliance/privacymark/
https://www.example_domain.com/compliance/
DESCRIPTION(s):
Compliance
California Consumer Privacy Act (CCPA)
Disaster Recovery
GDPR
PCI DSS
PrivacyMark
View All
------------------------------------------------------------------------------------
------------------------------------------------------------------------------------
任何想法我做错了什么?
谢谢!
【问题讨论】:
-
这与 Cheerio 无关;这就是您使用 jQuery 访问数据的方式。
attr()函数返回单个值。如果您需要多个值,则需要遍历$(element).find("a")并从每个结果中提取href。查看map()。 -
将标题从 Cheerio 更新为 jQuery...
-
还有,你为什么要这样做?
-
@NathanielFlick 我们正在使用描述和指向我们公共法律和合规页面的链接填充内部知识库。
标签: javascript jquery node.js href cheerio