【发布时间】:2020-02-28 17:24:33
【问题描述】:
问题背景:
我需要“textwrap”一个 HTML 字符串,以便将 <br> 元素应用于 仅 HTML 字符串中的 text。
我可以将样式应用于文本字符串(如果只需要 一种 类型的样式)。
但是,将样式附加到此字符串会进一步混淆实际文本与样式标记(显然)。
示例:
s = 'Here is a string'
styled_str = styling_func1(s)
print(styled_str)
# >>> "<font color='black'>Here</font> is a string"
styled_str = syling_func2(styled_str)
print(styled_str)
# >>> "<font <br>color='black'>Here</font> is a string"
如您所见,如果styling_func2 对字符串进行操作,<br> 就会卡在标签中。
我需要的实际功能是每 ~N 个字符或单词添加<br> 元素,而不会导致这些冲突。
尝试解决方案:
from bs4 import BeautifulSoup
s = "Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span> ========= Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span>"
soup = BeautifulSoup(s)
# How to keep the previous tags while inserting these breaks?
"<br>".join(textwrap.wrap(soup.get_text(), 50))
示例测试数据:
字符串输入:
<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span> ========= Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>
即
作者更正:混合有机-无机极化激元激光.. 对本文的更正已发表,链接自本文的 HTML 和 PDF 版本。该错误没有在论文中得到修正。 ========= 出版商更正:1 型糖尿病慢性肾病的预测因素:来自 AMD 年鉴的纵向研究倡议 .. 已发布对本文的更正,并链接自本文的 HTML 和 PDF 版本。该错误已 已在论文中修复。
(粗体为红色)
期望的输出:
<html><body><p>Author Correction: Hybrid organic-inorganic <br>polariton laser<span style="color:red">.. A correction to this article has <br>been published and is linked from the HTML and PDF <br>versions of this paper. The error has </span>not<span style="color:red"> been <br>fixed in the paper.</span> ========= Publisher <br>Correction: Predictors of chronic kidney disease <br>in type 1 diabetes: a longitudinal study from the <br>AMD Annals initiative<span style="color:red">.. A correction to this <br>article has been published and is linked from the <br>HTML and PDF versions of this paper. The error has </span><span style="color:red"> <br>been fixed in the paper.</span></p></body></html>
即
作者更正:混合有机-无机极化子
激光..对本文的更正
已发布并从 HTML 和 PDF 链接
本文的版本。错误不是
固定在论文中。 ========= Publisher
更正:慢性肾病的预测因素
1 型糖尿病:来自
的纵向研究AMD 年鉴计划 .. 对此的更正
文章已发表并链接自
本文的 HTML 和 PDF 版本。错误有
已在论文中修正。
【问题讨论】:
-
在文本换行时需要考虑很多事情——字体大小、不成比例的字符、字符间的间隙,这些可以通过不同的标记标签在整个文本中发生变化。知道这一切的是浏览器将文本呈现到屏幕上。它由CSS控制。为什么不添加样式而不是手动添加
? -
这适用于 没有* 任何 css 功能的应用程序。专门用于在图形包中显示小的文本 sn-ps。只有一小部分 html 标签可以使用它,但
<span>和<br>标签可以使用。 *我知道这可能不是技术上正确的,因为使用了内联 html 样式,但我不太了解处理输入的包的内部结构。 -
我使用
<div>s 对这些字符串进行了不错的样式设置——在我意识到它们没有被我使用的包处理之前,所以我不得不恢复到这些<span>和<br>标记现在。
标签: python html string beautifulsoup