【发布时间】:2022-02-18 19:33:33
【问题描述】:
如何使用 NodeJS 有效地将 html 转换为文本,即在浏览器之外?我还想将 ä 等实体转换为 ä 等,而不仅仅是从 html 中删除标签。
这是一个函数 convertHtmlToText 的 JEST 单元测试,它执行此转换:
it('when extract from partial html should extract text', () => {
const html = `<p> äü
\t<img alt="" src="http://www.test.org:80/imageupload/userfiles/2/images/world med new - 2022.jpg" style="width: 2000px; height: 1047px; max-width: 100%; height: auto;" /></p>
<p>
\tAn evening of music, silence and guiding thoughts to help us experience inner peace, connect with the Divine and share loving vibrations with the world. Join millions of people throughout the world to contribute in creating a wave of peace.</p>
<div>
\t </div>
<div>
\t<strong>Please join ....</strong></div>
<div>
\t </div>
<div>
\t<strong>Watch live: <a href="https://test.org/watchlive" target="_blank">test.org/watchlive</a></strong></div>`
const text = convertHtmlToText(html)
console.log(text)
expect(text).toContain("ä");
expect(text).toContain("ü");
expect.not.stringContaining("<")
expect.not.stringContaining(">")
});
【问题讨论】:
标签: javascript html node.js