【问题标题】:Extract Text from <p> element over <br> elements在 <br> 元素上从 <p> 元素中提取文本
【发布时间】:2018-08-09 22:19:55
【问题描述】:

我正在编写一个脚本,使用BeautifulSoup&lt;p&gt; 元素中提取文本;它运行良好,直到我遇到包含&lt;br&gt; 标签的&lt;p&gt; 元素,在这种情况下,它只捕获第一个&lt;br&gt; 标签之前的文本。如何编辑我的代码以捕获所有文本?

我的代码:

coms = soup.select('li > div[class=comments]')[0].select('p')
inp = [i.find(text=True).lstrip().rstrip() for i in coms]

问题 HTML(注意&lt;br&gt; 标签):

<p>             
                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon.                </p>

我的代码当前输出的内容:

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.'

我的代码应该输出什么(注意额外的文本):

>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen. ITR info: Rachel Hoffman, CD Chris Kory, acc. Monitor is Iftiaz Haroon.'

注意:请原谅我有时有问题的术语;我基本上是自学的。)

【问题讨论】:

标签: python html web-scraping beautifulsoup


【解决方案1】:

我担心这个问题可能是错误的。我将 HTML 复制到一个文件中,然后运行以下代码:

>>> import bs4
>>> soup = bs4.BeautifulSoup(open('matthew.htm').read(), 'lxml')
>>> soup.find('p').text
'             \n                    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.\n\nITR info:\n\nRachel Hoffman, CD\nChris Kory, acc.\n\nMonitor is Iftiaz Haroon.                '

显然,恢复所需文本是一件简单的事情。

【讨论】:

    【解决方案2】:

    您可以使用get_text(strip=True)

    来自文档:

    如果您只想要文档或标签的文本部分,您可以使用get_text() 方法。它以单个 Unicode 字符串的形式返回文档中或标签下的所有文本。

    您可以使用 strip=True 告诉 Beautiful Soup 从每一位文本的开头和结尾去除空格。

    html = '''<p>             
                        Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
    <br>
    ITR info:<br>
    <br>
    Rachel Hoffman, CD<br>
    Chris Kory, acc.<br>
    <br>
    Monitor is Iftiaz Haroon.                </p>'''
    
    soup = BeautifulSoup(html, 'lxml')
    print(soup.find('p').get_text(strip=True))
    

    输出:

    Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.ITR info:Rachel Hoffman, CDChris Kory, acc.Monitor is Iftiaz Haroon.
    

    【讨论】:

      猜你喜欢
      • 2021-02-10
      • 1970-01-01
      • 2018-12-14
      • 2014-11-17
      • 1970-01-01
      • 2013-10-22
      • 1970-01-01
      • 2021-09-02
      • 1970-01-01
      相关资源
      最近更新 更多