BeautifulSoup 解析器未按标签正确拆分答案

【问题标题】：BeautifulSoup parser not properly splitting by tagsBeautifulSoup 解析器未按标签正确拆分
【发布时间】：2016-11-27 11:57:56
【问题描述】：

我正在抓取一个网站，然后尝试分成几段。通过查看抓取的文本，我可以非常清楚地看到某些段落分隔符没有正确拆分。请参阅下面的代码以重新创建问题！

from bs4 import BeautifulSoup
import requests

link = "http://www.presidency.ucsb.edu/ws/index.php?pid=111395"
response = requests.get(link)
soup = BeautifulSoup(response.content, 'html.parser')
paras = soup.findAll('p')
# Note that in printing the below, there are still a lot of "<p>" in that paragraph :( 
print paras[614]

我尝试过使用其他解析器——类似的问题。

【问题讨论】：

标签： python python-2.7 parsing web-scraping beautifulsoup

【解决方案1】：

lxml解析器你试过了吗？我有类似的问题，lxml 解决了我的问题。

import lxml
...
soup = BeautifulSoup(response.text, "lxml")

也可以代替response.content 尝试response.text 来获取unicode 对象。

【讨论】：

不幸的是，不起作用（lxml 或使用 response.text）。谢谢你的建议！

【解决方案2】：

这是设计使然。发生这种情况是因为页面包含嵌套段落，例如：

<p>Neurosurgeon Ben Carson. [<i>applause</i>] <p>New Jersey

我会使用这个小技巧来解决问题：

html = response.content.replace('<p>', '</p><p>')  # so there will be no nested <p> tags in your soup

# then your code

【讨论】：