【问题标题】:webscraping multiple pages using python使用python抓取多个页面
【发布时间】:2018-06-23 10:03:26
【问题描述】:

我在抓取the webpage 时遇到问题

网址从 1 开始增加 30。它包含许多页面,其中包含肯尼亚的中学列表。每个页面都有 30 所学校的列表。我想用以下代码抓取所有数据,但它只给出一页的内容,即 30 所学校。我已经对 url 进行了字符串格式化,但仍然返回一页的数据。我的代码:

#IMPORTING RELEVANT PACKAGES FOR THE WORK
import requests
from bs4 import BeautifulSoup
import time

#DEFINING THE FIRST WEBPAGE
num = 1
#STRING FORMATTING THE URL TO CAPTURE DIFFRENT PAGES
url = 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(num)
#DEIFING THE BROWSER HEADERS SO THAT CAN WORK ON IT WITHOUT ERRORS
headers = {'User-Agent':'Mozilla'}
#GOING THROUGH ALL THE PAGES AND THE LINKS
while num < 452:
    url = 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(num)
    time.sleep(1)
    num += 30
    response = requests.get(url,headers)
    soup = BeautifulSoup(response.text,'html.parser')
    school_info = soup.find_all('div', attrs={'class':'c-detail'})
#EXTRACTING SPECIFIC RECORDS    
records = []
for name in school_info:
    Name_of_The_School = name.find('a').text
    Location_of_The_School = name.contents[2][2:]
    Contact_of_The_School = name.contents[4]
    Information_Link = name.find('a')['href']
#converting the records to a tuple
       records.append((Name_of_The_School,
                       Location_of_The_School,
                       Contact_of_The_School,
                       Information_Link))
#EXPORTING TO A PANDAS FILE    
import pandas as pd
df = pd.DataFrame(records, columns = ['Name of The School',
                                      'Location of The School',
                                      'Contact of The School',
                                      'Information_Link'])
df.to_csv('PRIVATE_SECONDARY.csv', index = False, encoding = 'utf-8')

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    records = [] 移出while 循环:

    records = []
    while num < 452:
        url = 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(num)
        time.sleep(1)
        num += 30
        response = requests.get(url,headers)
        soup = BeautifulSoup(response.text,'html.parser')
        school_info = soup.find_all('div', attrs={'class':'c-detail'})
        #EXTRACTING SPECIFIC RECORDS    
        for name in school_info:
            Name_of_The_School = name.find('a').text
            Location_of_The_School = name.contents[2][2:]
            Contact_of_The_School = name.contents[4]
            Information_Link = name.find('a')['href']
            #converting the records to a tuple
            records.append((Name_of_The_School,
                            Location_of_The_School,
                            Contact_of_The_School,
                            Information_Link))
    

    【讨论】:

      【解决方案2】:

      简单的错误逻辑,在while 循环的每次迭代中,它都会覆盖局部变量school_info,因此,在下一个for 循环中剩下的将是while 中的最后一个值循环。

      我冒昧地对其进行了重组:

      import requests
      from bs4 import BeautifulSoup
      import time
      import pandas as pd
      
      headers = {'User-Agent':'Mozilla'}
      
      def get_url(batch):
          return 'https://www.kenyaplex.com/schools/?start={}&SchoolType=private-secondary-schools'.format(batch)
      
      school_data = []
      records = []
      
      for batch in range(1, 453, 30):  # the scrapper saves the results per iteration
          response  = requests.get(get_url(batch), headers)
          soup = BeautifulSoup(response.text,'html.parser')
          school_info = soup.find_all('div', attrs={'class':'c-detail'})
          school_data.extend(school_info)
      
      for name in school_data:  # further parsing and records collection
          Name_of_The_School = name.find('a').text
          Location_of_The_School = name.contents[2][2:]
          Contact_of_The_School = name.contents[4]
          Information_Link = name.find('a')['href']
          records.append((Name_of_The_School,Location_of_The_School,Contact_of_The_School,Information_Link))
      
          time.sleep(1)
      
      df = pd.DataFrame(records, columns = ['Name of The School','Location of The School','Contact of The School','Information_Link'])
      df.to_csv('PRIVATE_SECONDARY.csv', index = False, encoding = 'utf-8')
      

      【讨论】:

        猜你喜欢
        • 2017-11-17
        • 2014-12-17
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-07-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多