【问题标题】:BeautifulSoup to scrape multiple linkBeautifulSoup 抓取多个链接
【发布时间】:2021-04-10 23:31:09
【问题描述】:

我想通过使用 BeautifulSoup 来抓取这个website,首先提取每个链接,然后一个一个地打开它们。打开它们后,我想抓取公司名称、股票代码、证券交易所,并在可用时提取多个 PDF 链接。之后它会将它们写在一个 csv 文件中。

为了实现它,我首先尝试这种方式:

import requests
from bs4 import BeautifulSoup
import re
import time

source_code = requests.get('https://www.responsibilityreports.co.uk/Companies?a=#')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []
base = 'https://www.responsibilityreports.co.uk'

for link in soup.find_all('a', href=True):
    data.append(str(link.get('href')))
    print(link)
    try:
        for link in links:
            url = base + link
            req = requests.get(url)
            soup = BeautifulSoup(req.content, 'html.parser')
            for j in soup.find_all('a', href=True):
                print(j)
    except:
        pass

据我所知,本网站不禁止爬虫。但是,虽然它实际上为我提供了每个链接,但我无法打开它们,这让我无法让我的爬虫继续执行以下任务。

提前致谢!

【问题讨论】:

  • 您正在更改 soup 对象,就在它迭代的过程中。您需要先将所有链接提取到一个列表中,然后才开始获取这些链接。

标签: python web-scraping beautifulsoup


【解决方案1】:

您可以使用此示例来遍历所有公司链接:

import requests
from bs4 import BeautifulSoup


url = "https://www.responsibilityreports.co.uk/Companies?a=#"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

links = [
    "https://www.responsibilityreports.co.uk" + a["href"]
    for a in soup.select('a[href^="/Company"]')
]

for link in links:
    soup = BeautifulSoup(requests.get(link).content, "html.parser")

    name = soup.select_one("h1").get_text(strip=True)
    ticker = soup.select_one(".ticker_name")
    if ticker:
        ticker = ticker.get_text(strip=True)
    else:
        ticker = "N/A"

    # extract other info...

    print(name)
    print(ticker)
    print(link)
    print("-" * 80)

打印:

3i Group plc
III
https://www.responsibilityreports.co.uk/Company/3i-group-plc
--------------------------------------------------------------------------------
3M Corporation
MMM
https://www.responsibilityreports.co.uk/Company/3m-corporation
--------------------------------------------------------------------------------
AAON Inc.
AAON
https://www.responsibilityreports.co.uk/Company/aaon-inc
--------------------------------------------------------------------------------
ABB Ltd
ABB
https://www.responsibilityreports.co.uk/Company/abb-ltd
--------------------------------------------------------------------------------
Abbott Laboratories
ABT
https://www.responsibilityreports.co.uk/Company/abbott-laboratories
--------------------------------------------------------------------------------
Abbvie Inc
ABBV
https://www.responsibilityreports.co.uk/Company/abbvie-inc
--------------------------------------------------------------------------------
Abercrombie & Fitch
ANF
https://www.responsibilityreports.co.uk/Company/abercrombie-fitch
--------------------------------------------------------------------------------
ABM Industries, Inc.
ABM
https://www.responsibilityreports.co.uk/Company/abm-industries-inc
--------------------------------------------------------------------------------
Acadia Realty Trust
AKR
https://www.responsibilityreports.co.uk/Company/acadia-realty-trust
--------------------------------------------------------------------------------
Acciona
N/A
https://www.responsibilityreports.co.uk/Company/acciona
--------------------------------------------------------------------------------
ACCO Brands
ACCO
https://www.responsibilityreports.co.uk/Company/acco-brands
--------------------------------------------------------------------------------

...and so on.

【讨论】:

    猜你喜欢
    • 2021-06-22
    • 2015-04-20
    • 1970-01-01
    • 1970-01-01
    • 2021-12-27
    • 2018-07-29
    • 1970-01-01
    • 1970-01-01
    • 2015-07-25
    相关资源
    最近更新 更多