【问题标题】:Add a break every N characters in an HTML string在 HTML 字符串中每 N 个字符添加一个换行符
【发布时间】:2020-02-28 17:24:33
【问题描述】:

问题背景:

我需要“textwrap”一个 HTML 字符串,以便将 <br> 元素应用于 HTML 字符串中的 text

我可以将样式应用于文本字符串(如果只需要 一种 类型的样式)。

但是,将样式附加到此字符串会进一步混淆实际文本与样式标记(显然)。

示例:

s = 'Here is a string'
styled_str = styling_func1(s)
print(styled_str)

#        >>> "<font color='black'>Here</font> is a string"

styled_str = syling_func2(styled_str)
print(styled_str)

#        >>> "<font <br>color='black'>Here</font> is a string"

如您所见,如果styling_func2 对字符串进行操作,&lt;br&gt; 就会卡在标签中。

我需要的实际功能是每 ~N 个字符或单词添加&lt;br&gt; 元素,而不会导致这些冲突。

尝试解决方案:

from bs4 import BeautifulSoup

s = "Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span>"

soup = BeautifulSoup(s)

# How to keep the previous tags while inserting these breaks?
"<br>".join(textwrap.wrap(soup.get_text(), 50))

示例测试数据:

字符串输入:

<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>

作者更正:混合有机-无机极化激元激光.. 对本文的更正已发表,链接自本文的 HTML 和 PDF 版本。该错误没有在论文中得到修正。 ========= 出版商更正:1 型糖尿病慢性肾病的预测因素:来自 AMD 年鉴的纵向研究倡议 .. 已发布对本文的更正,并链接自本文的 HTML 和 PDF 版本。该错误已 已在论文中修复。

(粗体为红色)

期望的输出:

<html><body><p>Author Correction: Hybrid organic-inorganic <br>polariton laser<span style="color:red">.. A correction to this article has <br>been published and is linked from the HTML and PDF <br>versions of this paper. The error has </span>not<span style="color:red"> been <br>fixed in the paper.</span>  =========  Publisher <br>Correction: Predictors of chronic kidney disease <br>in type 1 diabetes: a longitudinal study from the <br>AMD Annals initiative<span style="color:red">.. A correction to this <br>article has been published and is linked from the <br>HTML and PDF versions of this paper. The error has </span><span style="color:red"> <br>been fixed in the paper.</span></p></body></html>

作者更正:混合有机-无机极化子

激光..对本文的更正

已发布并从 HTML 和 PDF 链接

本文的版本。错误不是

固定在论文中。 ========= Publisher

更正:慢性肾病的预测因素

1 型糖尿病:来自

的纵向研究

AMD 年鉴计划 .. 对此的更正

文章已发表并链接自

本文的 HTML 和 PDF 版本。错误有

已在论文中修正。

【问题讨论】:

  • 在文本换行时需要考虑很多事情——字体大小、不成比例的字符、字符间的间隙,这些可以通过不同的标记标签在整个文本中发生变化。知道这一切的是浏览器将文本呈现到屏幕上。它由CSS控制。为什么不添加样式而不是手动添加
  • 这适用于 没有* 任何 css 功能的应用程序。专门用于在图形包中显示小的文本 sn-ps。只有一小部分 html 标签可以使用它,但 &lt;span&gt;&lt;br&gt; 标签可以使用。 *我知道这可能不是技术上正确的,因为使用了内联 html 样式,但我不太了解处理输入的包的内部结构。
  • 我使用&lt;div&gt;s 对这些字符串进行了不错的样式设置——在我意识到它们没有被我使用的包处理之前,所以我不得不恢复到这些&lt;span&gt;&lt;br&gt; 标记现在。

标签: python html string beautifulsoup


【解决方案1】:

大致的暴力破解方式,如果我们考虑没有标签封装(否则,请告诉),可能是:

def put_tags_every_N(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
        _i += 1
    return _input

def put_tags_every_N_nowordcut(_input, _tag, _n):
    _k = 0
    _in_tag = False
    _position_past = False
    _len = len(_input)
    _i = 0
    while  _i < _len:
        _c = _input[_i]
        if _c == '<':
            _in_tag = True
        elif _c == '>':
            _in_tag = False
        if not _in_tag and _c != '>':
            _k += 1
            if _k % _n == 0:
                _position_past = True
            if _position_past and (_input[_i+1] in ('<', '.') or _input[_i] == ' '):
                _input = _input[:_i+1]+_tag+_input[_i+1:]
                _len += len(_tag)
                _position_past = False
                _k = 0
        _i += 1
    return _input

_tmp_input = '<html><body><p>Author Correction: Hybrid organic-inorganic polariton laser<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span>not<span style="color:red"> been fixed in the paper.</span>  =========  Publisher Correction: Predictors of chronic kidney disease in type 1 diabetes: a longitudinal study from the AMD Annals initiative<span style="color:red">.. A correction to this article has been published and is linked from the HTML and PDF versions of this paper. The error has </span><span style="color:red"> been fixed in the paper.</span></p></body></html>'
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N(_input=_tmp_input, _tag='<br>', _n=5))
print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=50))
# print(put_tags_every_N_nowordcut(_input=_tmp_input, _tag='<br>', _n=5))

(put_tags_every_N 只是每 _n 个字符放置 _tag,而 put_tags_every_N_nowordcut 一旦计数 _n 个字符就将 _tag 放在第一个机会,以保留单词,而不是在下一个单词的开头放置一个单词中间空格行。)

【讨论】:

    猜你喜欢
    • 2020-12-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多