【发布时间】:2018-08-09 22:19:55
【问题描述】:
我正在编写一个脚本,使用BeautifulSoup 从<p> 元素中提取文本;它运行良好,直到我遇到包含<br> 标签的<p> 元素,在这种情况下,它只捕获第一个<br> 标签之前的文本。如何编辑我的代码以捕获所有文本?
我的代码:
coms = soup.select('li > div[class=comments]')[0].select('p')
inp = [i.find(text=True).lstrip().rstrip() for i in coms]
问题 HTML(注意<br> 标签):
<p>
Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.<br>
<br>
ITR info:<br>
<br>
Rachel Hoffman, CD<br>
Chris Kory, acc.<br>
<br>
Monitor is Iftiaz Haroon. </p>
我的代码当前输出的内容:
>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen.'
我的代码应该输出什么(注意额外的文本):
>> 'Alts called now through 53. No more will be called til the 12:50 group. EMCs are still on the table to be seen. ITR info: Rachel Hoffman, CD Chris Kory, acc. Monitor is Iftiaz Haroon.'
(注意:请原谅我有时有问题的术语;我基本上是自学的。)
【问题讨论】:
标签: python html web-scraping beautifulsoup