【问题标题】:How to get open url and get it's content using web crawler [duplicate]如何使用网络爬虫获取打开的网址并获取其内容[重复]
【发布时间】:2021-11-30 18:34:19
【问题描述】:

我正在尝试使用网络爬虫从体育、主页、世界、商业和技术中获取新闻内容, 我有这段代码,它在其中获取页面的标题和 url,我如何获取页面的 url 并打开它并在正文中获取它的内容

#python code
import requests
from bs4 import BeautifulSoup

url = "https://www.aaa.com"
page = requests.get(url)

soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
headlines = soup.find('body').find_all('h3')

for title in soup.findAll('a', href=True): #give me type
    if re.search(r"\d+$", title['href']):
      print(title['href'])

【问题讨论】:

    标签: python web-crawler


    【解决方案1】:

    您必须将基本 url 加入您提取的 href,然后重新开始请求。

    for title in soup.find_all('a', href=True): 
        if re.search(r"\d+$", title['href']):
            
            page = requests.get('https://www.bbc.com'+title['href'])
            soup = BeautifulSoup(page.content, 'html.parser')
            print(soup.h1.text)
    
    注意
    • 你的regex 工作不正常,所以要小心

    • 尝试温和地刮擦,例如使用time 模块来增加一些延迟

    • 有些网址重复

    示例(有一些调整)

    将打印文章的前 150 个字符:

    import requests,time
    from bs4 import BeautifulSoup
    baseurl = 'https://www.bbc.com'
    
    def get_soup(url):
        page = requests.get(url)
        soup = BeautifulSoup(page.text, 'html.parser')
        return soup
    
    def get_urls(url):
        urls = []
        for link in get_soup(url).select('a:has(h3)'):
            if url.split('/')[-1] in link['href']:
                urls.append(baseurl+link['href'])
        urls = list(set(urls))
        return urls
    
    def get_news(url):
        for url in get_urls(url):
            item = get_soup(url)
            print(item.article.text[:150]+'...')
            time.sleep(2)
    
    get_news('https://www.bbc.com/news')
    

    输出

    New Omicron variant: Does southern Africa have enough vaccines?By Rachel Schraer & Jake HortonBBC Reality CheckPublished1 day agoSharecloseShare pageC...
    Ghislaine Maxwell: Epstein pilot testifies he flew Prince AndrewPublished9 minutes agoSharecloseShare pageCopy linkAbout sharingRelated TopicsJeffrey ...
    New mothers who died of herpes could have been infected by one surgeonBy James Melley & Michael BuchananBBC NewsPublished22 NovemberSharecloseShare pa...
    Parag Agrawal: India celebrates new Twitter CEOPublished9 hours agoSharecloseShare pageCopy linkAbout sharingImage source, TwitterImage caption, Parag...
    

    【讨论】:

    • 谢谢,我尝试了示例代码,但得到了 NotImplementedError: Only the following pseudo-classes areimplemented: nth-of-type。在 for 循环中获取 url 功能
    • 你的bs4版本更新了吗? pip install beautifulsoup --upgrade
    • 当我运行代码时,我得到了错误:找不到满足要求 beautifulsoup 的版本(来自版本:3.2.0、3.2.1、3.2.2)错误:找不到匹配的发行版美丽的汤
    • 但我尝试重新安装它!python3 -m pip install beautifulsoup4 我得到了这个输出要求已经满足:beautifulsoup4 in /usr/local/lib/python3.7/dist-packages (4.6.3 )
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2013-11-23
    • 2015-07-31
    • 1970-01-01
    • 2012-05-20
    • 1970-01-01
    • 2012-07-06
    • 1970-01-01
    相关资源
    最近更新 更多