【发布时间】:2021-08-14 15:41:48
【问题描述】:
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Pt
import requests
user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.``3945.88 Safari/537.37"
url = "https://www.atlanticcouncil.org/events/?ac-timing=past"
data = requests.get(url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(data.text, "lxml")
document = Document()
events = soup.find_all("div", class_ = "gta-embed--content gta-event-embed--content")
for event in events:
event_name = event.find("h3", class_ = "gta-event-embed--title gta-embed--title")
link = event.find("a")
try:
print(event_name.text)
document.add_paragraph(event_name.text, style='List Bullet')
print(link['href'])
document.add_paragraph(link['href'])
except:
continue
document.save('demo.docx')
URL1 = https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page=1 URL2 = https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page=2
我试过了,但没成功,是不是错了:
page = 1
while page != 6:
url = f"https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page={page}"
print(url)
page = page + 1
【问题讨论】:
-
在您的第二个代码块中,不要使用
while循环,而是使用for page in range(1,7): -
这不是您要找的吗?这些链接对我有用。
-
@MattDMo 实际上我正在尝试废弃这些 URL。我可以废弃第一页,但我也想要第二页。但它只是打印 URL 本身。
-
好吧,这就是代码所说的。构建好 URL 后,您需要抓取它。我给你写一个答案。
标签: python python-3.x web-scraping beautifulsoup