Python FeedParser 格式 Reddit Nicely答案

【问题标题】：Python FeedParser format Reddit NicelyPython FeedParser 格式 Reddit Nicely
【发布时间】：2015-08-30 16:08:02
【问题描述】：

我正在尝试创建一个程序，从 /r/Jokes 打印出前 5 个笑话，但我在格式化它以使其看起来不错时遇到了一些问题。我想把它设置成这样。

Post Title: Post Content

例如，这里是直接来自 RSS 提要的笑话之一：

<item>

    <title>What do you call a stack of pancakes?</title>

    <link>https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</link>

    <guid isPermaLink="true">https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/</guid>

    <pubDate>Sun, 30 Aug 2015 03:18:00 +0000</pubDate>

    <description><!-- SC_OFF --><div class="md"><p>A balanced breakfast</p> </div><!-- SC_ON --> submitted by <a href="http://www.reddit.com/user/TheRealCreamytoast"> TheRealCreamytoast </a> <br/> <a href="http://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[link]</a> <a href="https://www.reddit.com/r/Jokes/comments/3ix348/what_do_you_call_a_stack_of_pancakes/">[2 comments]</a></description>

</item>

我目前正在打印标题，后跟一个冒号和一个空格，然后是描述。但是，它会打印所有文本，包括链接、作者和所有 HTML 标记。我如何才能获得段落标签内的文本。

谢谢，

编辑：这是我的代码：

d = feedparser.parse('https://www.reddit.com/r/cleanjokes/.rss')
print("")
print("Pulling latest jokes from Reddit. https://www.reddit.com/r/cleanjokes")
print("")
time.sleep(0.8)
print("Displaying First 5 Jokes:")
print("")
print(d['entries'][0]['title'] + ": " + d['entries'][0]['description'])
print(d['entries'][1]['title'] + ": " + d['entries'][1]['description'])
print(d['entries'][2]['title'] + ": " + d['entries'][2]['description'])
print(d['entries'][3]['title'] + ": " + d['entries'][3]['description'])
print(d['entries'][4]['title'] + ": " + d['entries'][4]['description'])

这只是获取前 5 个条目。我需要做的是将冒号后的描述字符串格式化为仅包含段落标签内的文本。

【问题讨论】：

标题是怎么得到的？（我想看一些代码，它会帮助我（也许））。
用代码更新了 OP。

标签： python xml rss reddit feedparser

【解决方案1】：

Oren 关于使用 BeautifulSoup 是正确的，但我会尝试提供更完整的答案。

d['entries'][0]['description'] 返回 html，您需要对其进行解析。 bs 是一个很棒的图书馆。

您可以使用以下方式安装它：

pip install beautifulsoup4

from bs4 import BeautifulSoup 
soup = BeautifulSoup(d['entries'][0]['description'], 'html.parser') 
print(soup.div.get_text())

从条目的div 部分获取文本。

【讨论】：

我需要获取段落标签之间的文本，所以我将 div.get 更改为 p.get，但我收到此错误：AttributeError: 'NoneType' object has no attribute 'get_text'
@FeaturedEpic soup.p.get_text() 适合我。但原件也可以为您提供您想要的文本。 <p> 是 <div> 的子集（在这种情况下！）。

【解决方案2】：

你可以使用漂亮的肥皂包来做到这一点

Link to documention

from bs4 import BeautifulSoup 
soup = BeautifulSoup(html_doc, 'html.parser') 
print(soup.get_text())

【讨论】：