【问题标题】:Unable to get the anchor tag using beautifulsoup无法使用 beautifulsoup 获取锚标签
【发布时间】:2020-09-04 23:18:52
【问题描述】:
我想从部分内的锚标记列表中获取名称和链接,但我无法获取。
网址https://www.snopes.com/collections/new-coronavirus-collection/
category=[]
url=[]
for ul in soup.findAll('a',{"class":"collected-list"}):
if ul is not None:
category.append(ul.get_text())
else:
category.append("")
links = ul.findAll('a')
if links is not None:
for a in links:
url.append(a['href'])
早些时候,我能够得到列表和 URL,但现在网站结构发生了变化,我的代码不起作用,预期的输出是这样的
【问题讨论】:
标签:
html
python-3.x
beautifulsoup
【解决方案1】:
看起来感兴趣的a 标记现在是collected-item 而不是collected-list(现在是section 类)。您可以搜索类名称为collected-item 的所有a 标签,然后在同一个锚点下找到类title 的h5 标签,以获取标题描述,它似乎包含(通过一些操作)您描述的类别在您的预期输出中。
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.snopes.com/collections/new-coronavirus-collection/').text
soup = BeautifulSoup(source, 'lxml')
category=[]
url = []
for ul in soup.findAll('a',{"class":"collected-item"}):
if ul is not None:
title = ul.find('h5', {"class": "title"}).get_text()
title_short = title.replace("The Coronavirus Collection: ","")
category.append(title_short)
url.append(ul['href'])
for c,u in zip(category, url):
print(c,u)
Origins and Spread https://www.snopes.com/collections/coronavirus-origins-treatments/?collection-id=238235
Prevention and Treatments https://www.snopes.com/collections/coronavirus-collection-prevention-treatments/?collection-id=238235
Prevention and Treatments II https://www.snopes.com/collections/coronavirus-collection-prevention-treatments-2/?collection-id=238235
International Response https://www.snopes.com/collections/coronavirus-international-rumors/?collection-id=238235
US Government Response https://www.snopes.com/collections/coronavirus-government-role/?collection-id=238235
Trump and the Pandemic https://www.snopes.com/collections/coronavirus-collection-trump/?collection-id=238235
Trump and the Pandemic II https://www.snopes.com/collections/coronavirus-collection-trump-2/?collection-id=238235