【问题标题】:beautifulsoup: get inner content inside html tagsbeautifulsoup:获取 html 标签内的内容
【发布时间】:2019-07-17 11:00:10
【问题描述】:

我正在开发一个可以翻译 html 标签内文本的翻译器,我正在使用 beautifulsoup,因为它是 python 中最好的 html 解析器之一。

这是文本并将其加载到汤中

In [95]: chalet.html                                                                                                                                                                       
Out[95]: '<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>\r\n\r\n<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>\r\n\r\n<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>\r\n\r\n<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>\r\n\r\n<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>'

In [96]: html = soup(chalet.html)                                                                                                                                                          

In [97]: print(chalet.html)                                                                                                                                                                
<h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>

<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>

<p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>

<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>

<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>

接下来将其分解为内容,以便我可以解析它们

In [105]: html.contents                                                                                                                                                                    
Out[105]: 
[<h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>,
'\n',
<h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>,
'\n',
<p>Belle Chéry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>,
'\n',
<p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children’s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>,
'\n',
<p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children’s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>]

介于所有这些之间的是新行,我可以用 try 和 catch 块忽略它,但获取字符串似乎也只适用于其中的一些而不是全部

In [107]: contents[0]                                                                                                                                                                      
Out[107]: <h4><strong>“Create a space I would be truly excited to stay in”.</strong></h4>

In [108]: contents[0].string                                                                                                                                                               
Out[108]: '“Create a space I would be truly excited to stay in”.'

In [109]: contents[1]                                                                                                                                                                      
Out[109]: '\n'

In [110]: contents[1].string                                                                                                                                                               
Out[110]: '\n'

In [111]: contents[2]                                                                                                                                                                      
Out[111]: <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Chéry.</strong></h4>

In [112]: contents[2].string    

如果您知道如何以不剥离标签的方式提取这些部分,那么replace 将适用于主字符串。

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    使用 .stripped_strings 属性从 HTML 中获取干净、剥离的文本。

    https://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings

    from bs4 import BeautifulSoup
    from pprint import pprint
    
    html = '''
    <h4><strong>&ldquo;Create a space I would be truly excited to stay in&rdquo;.</strong></h4>
    <h4><strong>That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane&rsquo;s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet</strong> <strong>Belle Ch&eacute;ry.</strong></h4>
    <p>Belle Ch&eacute;ry is a chalet built without constraint. A destination, to be experienced. The building itself nestles into the mountain and you enter down a 15m underground tunnel that takes you from the garage and boot room into the very heart of the chalet.</p>
    <p>The chalet itself is spread over 680m<sup>2</sup>, with one side of the chalet almost entirely glazed, offering mountain views from all the living spaces and entertainment areas. The chalet can sleep up to 14 guests across 5 luxurious bedrooms and a family/children&rsquo;s bunk room, all opening out onto secluded terraces and enjoying free standing baths and hanging seats.</p>
    <p>The specification list for the chalet is, of course, almost endless and includes a 23-meter indoor-outdoor swimming pool, a Bamford Spa including treatment room and sauna, a private gym, a cinema room, art gallery and children&rsquo;s playroom. The living space is vast and includes a luxurious lounge area with open fireplace and delicious sofas, a library, a floating mezzanine dining area and a bar mezzanine with balcony overlooking the mountains.</p>
    '''
    soup = BeautifulSoup(html, 'html.parser')
    texts = [*soup.stripped_strings]
    pprint(texts)
    

    输出:

    ['“Create a space I would be truly excited to stay in”.',
     'That was the brief given to renowned architect, Herve Marullaz, after Chalet '
     'Joux Plane’s owner secured a large plot of mountain land that backed onto a '
     'stream and an alpine woodland. The result was Chalet',
     'Belle Chéry.',
     'Belle Chéry is a chalet built without constraint. A destination, to be '
    ...
    

    获取单个长字符串:

    long_string = ' '.join(texts)
    

    输出:

    “Create a space I would be truly excited to stay in”. That was the brief given to renowned architect, Herve Marullaz, after Chalet Joux Plane’s owner secured a large plot of mountain land that backed onto a stream and an alpine woodland. The result was Chalet Belle C ...
    

    【讨论】:

    • 我怎样才能得到soup的输出html字符串?这个输出很好地处理了标签。
    • str(soup) 会给你 HTML
    【解决方案2】:

    您可以使用 list comp 和 str.join 加入不带换行符的内容列表以获得所需的输出:

    contents = ''.join([data for data in html.contents if data != '\n'])
    

    现在,您可以制作汤了:

    soup = BeautifulSoup(contents, 'lxml')
    

    用您喜欢的解析器替换lxml

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2015-02-08
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多