【发布时间】:2021-06-17 12:25:47
【问题描述】:
我正在尝试从“https://www.bbc.com/news/coronavirus”中提取最新更新中的所有正文
我已经成功地从第一页中提取了正文(50 个中的一个)。
我想滚动到下一页并再次执行此过程。
这是我写的代码。
from bs4 import BeautifulSoup as soup
import requests
links = []
header = []
body_text = []
r = requests.get('https://www.bbc.com/news/coronavirus')
b = soup(r.content,'lxml')
# Selecting Latest update selection
latest = b.find(class_="gel-layout__item gel-3/5@l")
# Getting title
for news in latest.findAll('h3'):
header.append(news.text)
#print(news.text)
# Getting sub-links
for news in latest.findAll('h3',{'class':'lx-stream-post__header-title gel-great-primer-bold qa-post-title gs-u-mt0 gs-u-mb-'}):
links.append('https://www.bbc.com' + news.a['href'])
# Entering sub-links and extracting texts
for link in links:
page = requests.get(link)
bsobj = soup(page.content,'lxml')
for news in bsobj.findAll('div',{'class':'ssrcss-18snukc-RichTextContainer e5tfeyi1'}):
body_text.append(news.text.strip())
#print(news.text.strip())
我应该如何滚动到下一页?
【问题讨论】:
标签: web-scraping beautifulsoup python-requests web-crawler