【问题标题】:how do I extract data from linked pages in websites using python如何使用python从网站中的链接页面中提取数据
【发布时间】:2021-06-04 08:43:31
【问题描述】:

我一直在尝试从网页中抓取数据以进行数据分析项目,并且成功地从单个页面中获取数据。

import requests
from bs4 import BeautifulSoup
import concurrent.futures
from urllib.parse import urlencode
from scraper_api import ScraperAPIClient


    client = ScraperAPIClient('key')
    results = client.get(url = "https://www.essex.ac.uk/course-search?query=&f.Level%7CcourseLevel=Undergraduate").text
    
    print(results)

以网站“https://www.essex.ac.uk/course-search?query=&f.Level%7CcourseLevel=Undergraduate”为例,我需要在每门课程中导航并获取一个称为持续时间的数据从那个页面。

【问题讨论】:

    标签: python python-3.x web web-scraping web-scraping-language


    【解决方案1】:

    试试下面的:

    client = ScraperAPIClient('key')
    results = []
    for i in range(10):
       results.append(client.get(url = f"https://www.essex.ac.uk/course-search?query=&f.Level%7CcourseLevel=Undergraduate&start_rank={i}1").text)
        
    print(results)
    

    循环浏览 10 个结果页面并将每个文本响应放入结果列表中

    【讨论】:

      【解决方案2】:
      import requests
      from bs4 import BeautifulSoup
      import concurrent.futures
      from urllib.parse import urlencode
      from scraper_api import ScraperAPIClient
      
      client = ScraperAPIClient('key')
      total_pages = 12
      for page_no in range(total_pages):
          # you control this page_no variable.
          # go to the website and see how the api go to the next page
          # it depends on the 'start_rank' at the end of the URL
          # for example start_rank=10, start_rank=20 will get you one page after another
          rank = page_no * 10
          results = client.get(url="https://www.essex.ac.uk/course-search?query=&f.Level%7CcourseLevel=Undergraduate&start_rank={0}".format(rank)).text
          print(results)
      

      【讨论】:

        猜你喜欢
        • 2019-04-12
        • 1970-01-01
        • 1970-01-01
        • 2018-12-09
        • 2021-07-21
        • 2011-01-06
        • 2012-07-06
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多