【问题标题】：Add a break every N characters in an HTML string在 HTML 字符串中每 N 个字符添加一个换行符
【发布时间】：2020-02-28 17:24:33
【问题描述】：

问题背景：

我需要“textwrap”一个 HTML 字符串，以便将   元素应用于仅 HTML 字符串中的 text。

我可以将样式应用于文本字符串（如果只需要一种类型的样式）。

但是，将样式附加到此字符串会进一步混淆实际文本与样式标记（显然）。

示例：

s = 'Here is a string'
styled_str = styling_func1(s)
print(styled_str)

#        >>> "<font color='black'>Here</font> is a string"

styled_str = syling_func2(styled_str)
print(styled_str)

#        >>> "<font <br>color='black'>Here</font> is a string"

如您所见，如果styling_func2 对字符串进行操作，  就会卡在标签中。

我需要的实际功能是每 ~N 个字符或单词添加  元素，而不会导致这些冲突。

尝试解决方案：

from bs4 import BeautifulSoup

s = "Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span>"

soup = BeautifulSoup(s)

# How to keep the previous tags while inserting these breaks?
"<br>".join(textwrap.wrap(soup.get_text(), 50))

示例测试数据：

字符串输入：

<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>

即

作者更正：混合有机-无机极化激元激光.. 对本文的更正已发表，链接自本文的 HTML 和 PDF 版本。该错误没有在论文中得到修正。 ========= 出版商更正：1 型糖尿病慢性肾病的预测因素：来自 AMD 年鉴的纵向研究倡议 .. 已发布对本文的更正，并链接自本文的 HTML 和 PDF 版本。该错误已 已在论文中修复。

（粗体为红色）

期望的输出：

<html><body><p>Author Correction: Hybrid organic-inorganic <br>polariton laser<span style="color:red">.. A correction to this article has <br>been published and is linked from the HTML and PDF <br>versions of this paper. The error has </span>not<span style="color:red"> been <br>fixed in the paper.</span>  =========  Publisher <br>Correction: Predictors of chronic kidney disease <br>in type 1 diabetes: a longitudinal study from the <br>AMD Annals initiative<span style="color:red">.. A correction to this <br>article has been published and is linked from the <br>HTML and PDF versions of this paper. The error has </span><span style="color:red"> <br>been fixed in the paper.</span></p></body></html>

即

作者更正：混合有机-无机极化子

激光..对本文的更正

已发布并从 HTML 和 PDF 链接

本文的版本。错误不是

固定在论文中。 ========= Publisher

更正：慢性肾病的预测因素

1 型糖尿病：来自

的纵向研究

AMD 年鉴计划 .. 对此的更正

文章已发表并链接自

本文的 HTML 和 PDF 版本。错误有

已在论文中修正。

【问题讨论】：

在文本换行时需要考虑很多事情——字体大小、不成比例的字符、字符间的间隙，这些可以通过不同的标记标签在整个文本中发生变化。知道这一切的是浏览器将文本呈现到屏幕上。它由CSS控制。为什么不添加样式而不是手动添加
？
这适用于没有* 任何 css 功能的应用程序。专门用于在图形包中显示小的文本 sn-ps。只有一小部分 html 标签可以使用它，但  和   标签可以使用。 *我知道这可能不是技术上正确的，因为使用了内联 html 样式，但我不太了解处理输入的包的内部结构。
我使用<div>s 对这些字符串进行了不错的样式设置——在我意识到它们没有被我使用的包处理之前，所以我不得不恢复到这些 和  标记现在。

标签： python html string beautifulsoup

【解决方案1】：

大致的暴力破解方式，如果我们考虑没有标签封装（否则，请告诉），可能是：

def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))

(put_tags_every_N 只是每 _n 个字符放置 _tag，而 put_tags_every_N_nowordcut 一旦计数 _n 个字符就将 _tag 放在第一个机会，以保留单词，而不是在下一个单词的开头放置一个单词中间空格行。）

【讨论】：