解析子元素的html [BeautifulSoup]答案

【问题标题】：Parsing the html of the child element [BeautifulSoup]解析子元素的html [BeautifulSoup]
【发布时间】：2020-06-22 04:20:48
【问题描述】：

我只有两周时间学习 python。

我正在抓取一个 XML 文件和循环 [item->description] 的元素之一，里面有 HTML，我怎样才能得到 p 里面的文本？

url="https://www.milenio.com/rss"
source=requests.get(url)
soup=BeautifulSoup(source.content, features="xml")

items=soup.findAll('item')

for item in items:
  html_text=item.description
  # This returns HTML code: <p>Paragraph 1</p> <p>Paragraph 2</p>

下一行可以工作，但我有一些内部、外部链接和图片，这不是必需的。

desc=item.description.get_text()

所以，如果我创建一个循环 o 试图获取所有 p，它不起作用。

for p in html_text.find_all('p'):
  print(p)

AttributeError: 'NoneType' 对象没有属性 'find_all'

非常感谢！

【问题讨论】：

使用这个 SO 链接：stackoverflow.com/questions/2032172/…

标签： python beautifulsoup

【解决方案1】：

问题在于 bs4 是如何处理 CData 的（它有很好的文档记录，但没有得到很好的解决）。

您需要从 bs4 导入 CData，这将有助于将 CData 提取为字符串并使用 html.parser 库，从那里使用该字符串创建一个新的 bs4 对象，为其赋予 findAll 属性并迭代其内容.

from bs4 import BeautifulSoup, CData
import requests

url="https://www.milenio.com/rss"
source=requests.get(url)
soup = BeautifulSoup(source.content, 'html.parser')

items=soup.findAll('item')

for item in items:
  html_text = item.description
  findCdata = html_text.find(text=lambda tag: isinstance(tag, CData))
  newSoup = BeautifulSoup(findCdata, 'html.parser')
  paragraphs = newSoup.findAll('p')
  for p in paragraphs:
    print(p.get_text())

编辑： OP 需要提取链接文本，发现只能在项目循环内使用link = item.link.nextSibling，因为链接内容像</link>http://www... 一样跳到其标签之外。在 XML 树视图中，这个特定的 XML 文档显示了可能是原因的链接元素的下拉列表。

要从文档内的其他标签中获取内容，这些标签在 XML 树视图中不显示下拉菜单并且没有嵌套 CData，请将标签转换为小写并照常返回文本：

item.pubdate.get_text() # Gets contents the tag <pubDate>
item.author.get_text() # Gets contents of the tag <author>

【讨论】：

非常感谢您，它工作得很好，但现在有一个问题。使用“html.parser”，链接文本会超出标签... milenio.com/estados/… ||我正在尝试使用 item.text 但不起作用。有没有办法得到那个链接？非常感谢！
这段代码有效，获取标签后的链接“ link=item.link.nextSibling
很好，奇怪的是链接是在您的特定情况下发生这种情况的唯一实例。例如，<PubDate> 标签可以用item.pubdate.get_text() 抓取，并且它保留在其标签内。这可能与您的链接元素在 XML 树视图中获得下拉列表有关。我将编辑答案以包含更多信息以供将来使用。

【解决方案2】：

这应该是这样的：

for item in items:
    html_text=item.description #??

    #!! dont use html_text.find_all !!
    for p in item.find_all('p'):
        print(p)

【讨论】：