如何使用网络爬虫获取打开的网址并获取其内容[重复]答案

【问题标题】：How to get open url and get it's content using web crawler [duplicate]如何使用网络爬虫获取打开的网址并获取其内容[重复]
【发布时间】：2021-11-30 18:34:19
【问题描述】：

我正在尝试使用网络爬虫从体育、主页、世界、商业和技术中获取新闻内容，我有这段代码，它在其中获取页面的标题和 url，我如何获取页面的 url 并打开它并在正文中获取它的内容

#python code
import requests
from bs4 import BeautifulSoup

url = "https://www.aaa.com"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')

for title in soup.findAll('a', href=True): #give me type
    if re.search(r"\d+$", title['href']):
      print(title['href'])

【问题讨论】：

标签： python web-crawler

【解决方案1】：

您必须将基本 url 加入您提取的 href，然后重新开始请求。

for title in soup.find_all('a', href=True): 
    if re.search(r"\d+$", title['href']):
        
        page = requests.get('https://www.bbc.com'+title['href'])
        soup = BeautifulSoup(page.content, 'html.parser')
        print(soup.h1.text)

注意

你的regex 工作不正常，所以要小心
尝试温和地刮擦，例如使用time 模块来增加一些延迟
有些网址重复

示例（有一些调整）

将打印文章的前 150 个字符：

import requests,time
from bs4 import BeautifulSoup
baseurl = 'https://www.bbc.com'

def get_soup(url):
    page = requests.get(url)
    soup = BeautifulSoup(page.text, 'html.parser')
    return soup

def get_urls(url):
    urls = []
    for link in get_soup(url).select('a:has(h3)'):
        if url.split('/')[-1] in link['href']:
            urls.append(baseurl+link['href'])
    urls = list(set(urls))
    return urls

def get_news(url):
    for url in get_urls(url):
        item = get_soup(url)
        print(item.article.text[:150]+'...')
        time.sleep(2)

get_news('https://www.bbc.com/news')

输出

New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...

【讨论】：

谢谢，我尝试了示例代码，但得到了 NotImplementedError: Only the following pseudo-classes areimplemented: nth-of-type。在 for 循环中获取 url 功能
你的bs4版本更新了吗？ pip install beautifulsoup --upgrade
当我运行代码时，我得到了错误：找不到满足要求 beautifulsoup 的版本（来自版本：3.2.0、3.2.1、3.2.2）错误：找不到匹配的发行版美丽的汤
但我尝试重新安装它！python3 -m pip install beautifulsoup4 我得到了这个输出要求已经满足：beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (4.6.3 )