从 html 页面中提取 span 标签内容答案

【问题标题】：Extracting span tag content from html page从 html 页面中提取 span 标签内容
【发布时间】：2020-05-17 09:54:19
【问题描述】：

我正在尝试提取日期和时间，即从此链接发布的文章 - https://www.moneycontrol.com/news/tags/coronavirus.html/page-2/

for link in soup.findAll('span'):
        print(link)

这将给出页面的所有跨度。

<li class="clearfix" id="newslist-2">   
            <a href="https://www.moneycontrol.com/news/world/europe-should-temporarily-ban-chinese-takeovers-germanys-weber-5277251.html" title="Europe should temporarily ban Chinese takeovers - Germany's Weber"><img data="https://images.moneycontrol.com/static-mcnews/2020/05/Manfred-Weber-613x435.jpg" class="" src="https://images.moneycontrol.com/static-mcnews/2020/05/Manfred-Weber-613x435.jpg" data-src="https://images.moneycontrol.com/static-mcnews/2020/05/Manfred-Weber-613x435.jpg" alt="Europe should temporarily ban Chinese takeovers - Germany's Weber" title="Europe should temporarily ban Chinese takeovers - Germany's Weber"></a> <span>May 17, 2020 08:46 AM IST</span>    i>

我猜clearfix 类项目可能会有所帮助，但我不知道如何使用它。

您能帮我获取每篇文章的日期和时间吗？

【问题讨论】：

首先选择该类的所有li，然后使用for循环在每个元素内分别搜索link。

标签： python web-scraping beautifulsoup

【解决方案1】：

您可以使用[id^=newslist] css 选择器获取所有新闻文章，然后获取每篇文章的链接和发布日期：

for article in soup.select('[id^=newslist]'):
    link = article.select_one('a')['href']
    published_date = article.select_one('span').text
    print(published_date, link)

【讨论】：