【问题标题】:Beautiful Soup multiple URLs美丽的汤多个网址
【发布时间】:2021-08-14 15:41:48
【问题描述】:
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Pt
import requests

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.``3945.88 Safari/537.37"
url = "https://www.atlanticcouncil.org/events/?ac-timing=past"
data = requests.get(url, headers={"User-Agent": user_agent})
soup = BeautifulSoup(data.text, "lxml")

document = Document()

events = soup.find_all("div", class_ = "gta-embed--content gta-event-embed--content")
for event in events:
    event_name = event.find("h3", class_ = "gta-event-embed--title gta-embed--title")
    link = event.find("a")
    try:
        print(event_name.text)
        document.add_paragraph(event_name.text, style='List Bullet')
        print(link['href'])
        document.add_paragraph(link['href'])
    except:
        continue

document.save('demo.docx')

URL1 = https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page=1 URL2 = https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page=2

我试过了,但没成功,是不是错了:

page = 1
while page != 6:
      url = f"https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page={page}"
      print(url)
      page = page + 1

【问题讨论】:

标签: python python-3.x web-scraping beautifulsoup


【解决方案1】:

试试这个代码:

from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Pt
import requests

user_agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.``3945.88 Safari/537.37"

def scrape_and_write_to_docx(page):
    url = f"https://www.atlanticcouncil.org/events/?ac-timing=past&ac-page={page}"
    data = requests.get(url, headers={"User-Agent": user_agent})
    soup = BeautifulSoup(data.text, "lxml")
    events = soup.find_all("div", class_ = "gta-embed--content gta-event-embed--content")
    for event in events:
        event_name = event.find("h3", class_ = "gta-event-embed--title gta-embed--title")
        link = event.find("a")
        try:
            print(event_name.text)
            document.add_paragraph(event_name.text, style='List Bullet')
            print(link['href'])
            document.add_paragraph(link['href'])
        except:
            continue

document = Document()

for page in range(1,7):
    scrape_and_write_to_docx(str(page))

document.save('demo.docx')

我所做的主要是重新安排事情。我将实际的抓取和处理代码放入一个函数中,然后创建了一个快速的for 循环以运行第 1-6 页,并使用page 的每个值调用该函数。我一开始就打开了新的Document(),最后关闭/保存了它。

您需要对您编写的字符串进行一些调整,因为生成的文档看起来不太好,但它包含您要查找的所有信息。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2020-09-28
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-12-03
    • 2021-01-15
    相关资源
    最近更新 更多