【问题标题】:Get tag 'a' from beautiful soup从美丽的汤中获取标签'a'
【发布时间】:2020-11-01 06:30:45
【问题描述】:

我有一个 html 页面作为汤“a”。在那个页面上,我有兴趣在包含文本“AFT”(不区分大小写)的标签下找到 hreff。 这样做:

>>> rows = a.findAll('span', attrs={'class': 'views-field views-field-title'})

输出是:

[<span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201030-next-issuance-btfs" hreflang="en">30 October 2020: AFT’s next issuance of BTFs: Monday 02 November 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201030-next-issuance-oats" hreflang="en">30 October 2020: BFT’s next issuance of long-term OATs: Thursday 05 November 2020</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201026-issuance-btfs" hreflang="en">26 October 2020: AFT's issuance: 5.289 billion euros of BTFs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201023-next-issuance-btfs" hreflang="en">23 October 2020: AFT’s next issuance of BTFs: Monday 26 October 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201019-issuance-btfs" hreflang="en">19 October 2020: AFT's issuance: 5.489 billion euros of BTFs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201016-next-issuance-btfs" hreflang="en">16 October 2020: AFT’s next issuance of BTFs: Monday 19 October 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201015-next-issuance-inflation-indexed-oats" hreflang="en">15 October 2020: AFT’s issuance: 1.000 billion euros of inflation-indexed OATs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201015-issuance-oats" hreflang="en">15 October 2020: AFT’s issuance: 7.240 billion euros of medium-term OATs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201012-issuance-btfs" hreflang="en">12 October 2020: AFT's issuance: 5.288 billion euros of BTFs</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201009-next-issuance-indexed-oats" hreflang="en">09 October 2020: AFT’s next issuance of inflation-indexed OATs: Thursday 15 October 2020</a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201009-next-issuance-btfs" hreflang="en">09 October 2020: AFT’s next issuance of BTFs: Monday 12 October 2020 </a>
</span></span>, <span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201009-next-issuance-oats" hreflang="en">09 October 2020: AFT’s next issuance of medium-term OATs: Thursday 15 October 2020</a>
</span></span>]

所以从上面我想要所有hreff,除了this(列表的第二个元素)中的那个,因为它不包含'AFT'

<span class="views-field views-field-title"><span class="field-content">
<a href="/index.php/en/publications/communiques-presse/20201030-next-issuance-oats" hreflang="en">30 October 2020: BFT’s next issuance of long-term OATs: Thursday 05 November 2020</a>
</span></span>

有人可以帮助从rowsa 中提取hreff 作为列表吗? 谢谢。

【问题讨论】:

  • 不完全是。我对特定类型的 Hreff 感兴趣。
  • 什么类型的href?在你的例子中,所有的 href 都有文本 AFT
  • 请查看已编辑的问题。

标签: python html beautifulsoup


【解决方案1】:
href = [row.find('a').get('href') for row in rows if 'AFT' in row.text]
print(href)

输出

['/index.php/en/publications/communiques-presse/20201030-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201026-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201023-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201019-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201016-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201015-next-issuance-inflation-indexed-oats',
 '/index.php/en/publications/communiques-presse/20201015-issuance-oats',
 '/index.php/en/publications/communiques-presse/20201012-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201009-next-issuance-indexed-oats',
 '/index.php/en/publications/communiques-presse/20201009-next-issuance-btfs',
 '/index.php/en/publications/communiques-presse/20201009-next-issuance-oats']

【讨论】:

  • 但它并没有为我过滤掉“AFT”条件。请查看已编辑的问题。
  • 我回答时的问题中缺少此内容。我编辑了我的答案
  • 我回答后您确实编辑了您的问题和示例输入。
【解决方案2】:

您可以根据需要编写自定义查找器函数。

def aft_tag(tag):
    return tag.get('href') and 'AFT' in tag.text

for tag in soup.find_all(aft_tag):
    print(tag.get('href'))

另一种写法是:

for row in a.findAll('span', attrs={'class': 'views-field views-field-title'}):
    anchor = row.find('a')
    if 'AFT' in anchor:
        print(anchor.get('href'))

【讨论】:

  • 但这也会得到其他的href。我对我的问题中提到的带有标签“a”的“跨度”条件内的那些感兴趣。
  • 您可以在函数中添加更多条件。 tag.name == 'a' and 'foo' in tag.get('class', []) ...。有许多标签的方法和属性可供探索:-)
【解决方案3】:

要查找包含AFThref,您可以使用CSS 选择器contains(&lt;my text&gt;)

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_snippet, "html.parser")

# Select the class `views-field views-field-title` and `a` which contains the text `AFT`
for tag in soup.select(".views-field.views-field-title a:contains(AFT)"):
    print(tag['href'])
  

输出:

/index.php/en/publications/communiques-presse/20201030-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201026-issuance-btfs
/index.php/en/publications/communiques-presse/20201023-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201019-issuance-btfs
/index.php/en/publications/communiques-presse/20201016-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201015-next-issuance-inflation-indexed-oats
/index.php/en/publications/communiques-presse/20201015-issuance-oats
/index.php/en/publications/communiques-presse/20201012-issuance-btfs
/index.php/en/publications/communiques-presse/20201009-next-issuance-indexed-oats
/index.php/en/publications/communiques-presse/20201009-next-issuance-btfs
/index.php/en/publications/communiques-presse/20201009-next-issuance-oats

【讨论】:

    猜你喜欢
    • 2018-07-31
    • 2021-09-08
    • 2021-02-08
    • 2020-03-17
    • 1970-01-01
    • 2014-10-16
    • 1970-01-01
    • 1970-01-01
    • 2018-07-18
    相关资源
    最近更新 更多