【问题标题】:How extract all URLs in a website using BeautifulSoup如何使用 BeautifulSoup 提取网站中的所有 URL
【发布时间】:2026-01-20 02:10:02
【问题描述】:

我正在做一个需要从网站中提取所有链接的项目, 使用此代码,我将从单个 URL 获取所有链接:

import requests
from bs4 import BeautifulSoup, SoupStrainer

source_code = requests.get('https://*.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
links = []

for link in soup.find_all('a'):
    links.append(str(link))

问题是,如果我想提取所有 URL,我必须编写另一个 for 循环,然后再编写一个 ...。 我想提取该网站和该网站的子域中存在的所有 URL。 有没有办法做到这一点而不写嵌套? 即使使用嵌套的 for,我也不知道应该使用多少 for 来获取所有 URL。

【问题讨论】:

  • 不,不是。这个问题的答案也不再有效,因为从那时起 BeautifulSoup 发生了变化。
  • @Mona 很好,所以你需要使用 * 的API
  • 这是你们第二次删除你们的答案,:(((
  • 我需要一种适用于每个网站的算法。

标签: python url web-scraping beautifulsoup web-crawler


【解决方案1】:

哇,大约需要 30 分钟才能找到解决方案, 我找到了一种简单有效的方法来做到这一点, 正如@αԋɱҽԃ-αмєяιcαη 所提到的,有时如果您的网站链接到谷歌等大型网站,它不会停止,直到您的内存充满数据。 所以你应该考虑一些步骤。

  1. 创建一个 while 循环来搜索您的网站以提取所有网址
  2. 使用异常处理来防止崩溃
  3. 删除重复并分隔网址
  4. 设置 url 数量限制,例如找到 1000 个 url 时
  5. 停止 while 循环以防止您的 PC 内存变满

这里有一个示例代码,它应该可以正常工作,我实际测试过它,这对我来说很有趣:

import requests
from bs4 import BeautifulSoup
import re
import time

source_code = requests.get('https://*.com/')
soup = BeautifulSoup(source_code.content, 'lxml')
data = []
links = []


def remove_duplicates(l): # remove duplicates and unURL string
    for item in l:
        match = re.search("(?P<url>https?://[^\s]+)", item)
        if match is not None:
            links.append((match.group("url")))


for link in soup.find_all('a', href=True):
    data.append(str(link.get('href')))
flag = True
remove_duplicates(data)
while flag:
    try:
        for link in links:
            for j in soup.find_all('a', href=True):
                temp = []
                source_code = requests.get(link)
                soup = BeautifulSoup(source_code.content, 'lxml')
                temp.append(str(j.get('href')))
                remove_duplicates(temp)

                if len(links) > 162: # set limitation to number of URLs
                    break
            if len(links) > 162:
                break
        if len(links) > 162:
            break
    except Exception as e:
        print(e)
        if len(links) > 162:
            break

for url in links:
print(url)

输出将是:

https://*.com
https://www.*business.com/talent
https://www.*business.com/advertising
https://*.com/users/login?ssrc=head&returnurl=https%3a%2f%2f*.com%2f
https://*.com/users/signup?ssrc=head&returnurl=%2fusers%2fstory%2fcurrent
https://*.com
https://*.com
https://*.com/help
https://chat.*.com
https://meta.*.com
https://*.com/users/signup?ssrc=site_switcher&returnurl=%2fusers%2fstory%2fcurrent
https://*.com/users/login?ssrc=site_switcher&returnurl=https%3a%2f%2f*.com%2f
https://stackexchange.com/sites
https://*.blog
https://*.com/legal/cookie-policy
https://*.com/legal/privacy-policy
https://*.com/legal/terms-of-service/public
https://*.com/teams
https://*.com/teams
https://www.*business.com/talent
https://www.*business.com/advertising
https://www.g2.com/products/stack-overflow-for-teams/
https://www.g2.com/products/stack-overflow-for-teams/
https://www.fastcompany.com/most-innovative-companies/2019/sectors/enterprise
https://www.*business.com/talent
https://www.*business.com/advertising
https://*.com/questions/55884514/what-is-the-incentive-for-curl-to-release-the-library-for-free/55885729#55885729
https://insights.*.com/
https://*.com
https://*.com
https://*.com/jobs
https://*.com/jobs/directory/developer-jobs
https://*.com/jobs/salary
https://www.*business.com
https://*.com/teams
https://www.*business.com/talent
https://www.*business.com/advertising
https://*.com/enterprise
https://*.com/company/about
https://*.com/company/about
https://*.com/company/press
https://*.com/company/work-here
https://*.com/legal
https://*.com/legal/privacy-policy
https://*.com/company/contact
https://stackexchange.com
https://*.com
https://serverfault.com
https://superuser.com
https://webapps.stackexchange.com
https://askubuntu.com
https://webmasters.stackexchange.com
https://gamedev.stackexchange.com
https://tex.stackexchange.com
https://softwareengineering.stackexchange.com
https://unix.stackexchange.com
https://apple.stackexchange.com
https://wordpress.stackexchange.com
https://gis.stackexchange.com
https://electronics.stackexchange.com
https://android.stackexchange.com
https://security.stackexchange.com
https://dba.stackexchange.com
https://drupal.stackexchange.com
https://sharepoint.stackexchange.com
https://ux.stackexchange.com
https://mathematica.stackexchange.com
https://salesforce.stackexchange.com
https://expressionengine.stackexchange.com
https://pt.*.com
https://blender.stackexchange.com
https://networkengineering.stackexchange.com
https://crypto.stackexchange.com
https://codereview.stackexchange.com
https://magento.stackexchange.com
https://softwarerecs.stackexchange.com
https://dsp.stackexchange.com
https://emacs.stackexchange.com
https://raspberrypi.stackexchange.com
https://ru.*.com
https://codegolf.stackexchange.com
https://es.*.com
https://ethereum.stackexchange.com
https://datascience.stackexchange.com
https://arduino.stackexchange.com
https://bitcoin.stackexchange.com
https://sqa.stackexchange.com
https://sound.stackexchange.com
https://windowsphone.stackexchange.com
https://stackexchange.com/sites#technology
https://photo.stackexchange.com
https://scifi.stackexchange.com
https://graphicdesign.stackexchange.com
https://movies.stackexchange.com
https://music.stackexchange.com
https://worldbuilding.stackexchange.com
https://video.stackexchange.com
https://cooking.stackexchange.com
https://diy.stackexchange.com
https://money.stackexchange.com
https://academia.stackexchange.com
https://law.stackexchange.com
https://fitness.stackexchange.com
https://gardening.stackexchange.com
https://parenting.stackexchange.com
https://stackexchange.com/sites#lifearts
https://english.stackexchange.com
https://skeptics.stackexchange.com
https://judaism.stackexchange.com
https://travel.stackexchange.com
https://christianity.stackexchange.com
https://ell.stackexchange.com
https://japanese.stackexchange.com
https://chinese.stackexchange.com
https://french.stackexchange.com
https://german.stackexchange.com
https://hermeneutics.stackexchange.com
https://history.stackexchange.com
https://spanish.stackexchange.com
https://islam.stackexchange.com
https://rus.stackexchange.com
https://russian.stackexchange.com
https://gaming.stackexchange.com
https://bicycles.stackexchange.com
https://rpg.stackexchange.com
https://anime.stackexchange.com
https://puzzling.stackexchange.com
https://mechanics.stackexchange.com
https://boardgames.stackexchange.com
https://bricks.stackexchange.com
https://homebrew.stackexchange.com
https://martialarts.stackexchange.com
https://outdoors.stackexchange.com
https://poker.stackexchange.com
https://chess.stackexchange.com
https://sports.stackexchange.com
https://stackexchange.com/sites#culturerecreation
https://mathoverflow.net
https://math.stackexchange.com
https://stats.stackexchange.com
https://cstheory.stackexchange.com
https://physics.stackexchange.com
https://chemistry.stackexchange.com
https://biology.stackexchange.com
https://cs.stackexchange.com
https://philosophy.stackexchange.com
https://linguistics.stackexchange.com
https://psychology.stackexchange.com
https://scicomp.stackexchange.com
https://stackexchange.com/sites#science
https://meta.stackexchange.com
https://stackapps.com
https://api.stackexchange.com
https://data.stackexchange.com
https://*.blog?blb=1
https://www.facebook.com/official*/
https://twitter.com/*
https://linkedin.com/company/stack-overflow
https://creativecommons.org/licenses/by-sa/4.0/
https://*.blog/2009/06/25/attribution-required/
https://*.com
https://www.*business.com/talent
https://www.*business.com/advertising

Process finished with exit code 0

我将限制设置为 162,您可以根据需要增加任意数量,并且允许内存。

【讨论】:

  • 非常感谢。你拯救了我的一天 :) 代码又长又有点脏,但正如你所说,它工作得很好。
  • 欢迎您@mona,如果您在此社区提出问题之前在 Google 上搜索了您的问题,那么您应该已经找到了解决方案,顺便说一句,很高兴为您提供帮助。
  • 等等remove_duplicates()怎么了?!为什么不直接提取 url 并将其放入一个集合中?
  • @alexander-cécile 是的,我的代码很讨厌,我有点忙,所以我明天会编辑它,关于检查if len(links) &gt; 162,我这样做是为了在每一步之后检查这个条件我知道这没有必要。
  • @alexander-cécile 如果你有额外的时间,如果你编辑我的答案我会很高兴,否则我会稍后编辑它。
【解决方案2】:

怎么样?

import re,requests
from simplified_scrapy.simplified_doc import SimplifiedDoc 
source_code = requests.get('https://*.com/')
doc = SimplifiedDoc(source_code.content.decode('utf-8')) # incoming HTML string
lst = doc.listA(url='https://*.com/') # get all links
for a in lst:
  if(a['url'].find('*.com')>0): #sub domains
    print (a['url'])

你也可以使用这个抓取框架,它可以帮助你做很多事情

from simplified_scrapy.spider import Spider, SimplifiedDoc
class DemoSpider(Spider):
  name = 'demo-spider'
  start_urls = ['http://www.example.com/']
  allowed_domains = ['example.com/']
  def extract(self, url, html, models, modelNames):
    doc = SimplifiedDoc(html)
    lstA = doc.listA(url=url["url"])
    return [{"Urls": lstA, "Data": None}]

from simplified_scrapy.simplified_main import SimplifiedMain
SimplifiedMain.startThread(DemoSpider())

【讨论】:

    【解决方案3】:

    嗯,实际上你要求的是可能的,但这意味着一个无限循环,它将一直运行直到你的记忆BoOoOoOm

    反正思路应该是这样的。

    • 您将使用for item in soup.findAll('a')then item.get('href')

    • 添加到set 以消除重复的网址并与if 一起使用 条件is not None 以摆脱None 对象。

    • 然后不断循环直到你的set变成0 喜欢len(urls)

    【讨论】:

    • 这不是我的问题的答案,例如*中有超过10,000,000个问题,我需要一个提取所有存在URL的代码,包括所有*帖子的URL等。
    • 你能给我一个代码来知道如何实现这个想法吗?
    • @Mona 这需要你自己实现。因为您将需要使用try/excepttimeoutthreading.. 很多东西!如果href 持有url 的路径,例如/file1/file2/,那么这将需要f 字符串。像f"www.site.com/{url}" 和太多其他需要培养他/她自己去专注的事情。
    • @αԋɱҽԃαмєяιcαη 我在谷歌上搜索了关于这个问题的 ram 使用情况,使用 12 GB 的 ram,你可以在你的 ram 中存储大约 128849018 个 URL(每个 url 100 个字符)作为变量,所以我认为它赢了没问题。
    • @Ali 很好,所以您将永远运行您的程序并等待它完成。因为您正在谈论在没有线程的情况下运行的 for 循环,该线程仅使用一个变量,例如保持运行和停止的老鼠。但是使用线程的情况会有所不同。
    最近更新 更多