【问题标题】:Need help in the pagination of Web Scraping>Web Scraping的分页需要帮助>
【发布时间】:2021-09-06 10:58:32
【问题描述】:

因为我对 Python 和 WebScraping 非常陌生。任何人都可以在网站的分页部分提供帮助。

网站 - https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1

我能够抓取数据,即第一页的公司名称和日期。请帮助我从多个页面中抓取数据。

这是我的代码

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
from selenium.webdriver.support.select import Select
import time

driver=webdriver.Chrome(executable_path='C:\\Users\\chromedriver_win32\\chromedriver.exe')
driver.get('https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1')
driver.maximize_window()
time.sleep(1)

# append the data to list
CompanyName=driver.find_elements_by_xpath('/html/body/div[1]/div[3]/div[2]/section/section/table/tbody/tr/td[4]')
Date=driver.find_elements_by_xpath('/html/body/div[1]/div[3]/div[2]/section/section/table/tbody/tr/td[1]')


Name = []
for i in range(len(CompanyName)):
     Name.append(CompanyName[i].text)

data = pd.DataFrame(Name)

Date_ = []
for i in range(len(Date)):
    Date_.append(Date[i].text)

data['Date_'] = Date
data

【问题讨论】:

    标签: python python-3.x pandas selenium web-scraping


    【解决方案1】:

    您也可以使用next 选项移动到下一页并抓取详细信息。

    from selenium import webdriver
    import time
    
    driver = webdriver.Chrome(executable_path="path to chromedriver.exe")
    driver.maximize_window()
    driver.implicitly_wait(10)
    driver.get("https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1")
    
    details = []
    for i in range(3): # for 1st 3 pages, increase the range to scrape more pages.
        tables = driver.find_elements_by_xpath("//table[@class='filings-table']/tbody/tr") # Find individual rows 
        print(len(tables))
        for table in tables: # Extract details from all rows.
            company = table.find_element_by_xpath(".//td[4]").text # Extract Company name from that row
            date = table.find_element_by_xpath(".//td[1]").text # Extract date from that row
            details.append([company,date])
        driver.find_element_by_xpath("//a[contains(@class,'next')]").click() # Find and click on next page.
        time.sleep(2)
    print(len(details))
    for i in range(len(details)):
        print(details[i])
    driver.quit()
    

    输出:

    50
    50
    50
    150
    ['3rd Rock Logistics, LLC', '09/03/2021']
    ['3rd Rock Holdings, Inc.', '09/03/2021']
    ['Philippine Airlines, Inc.', '09/03/2021']
    ['Bennett Rosa, LLC', '09/03/2021']
    ['James David Theros', '09/03/2021']
    ...
    

    【讨论】:

      【解决方案2】:

      如果你注意第一页的url是

      https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage=1
      

      如果你想view page 2的内容,那么你将不得不改变cpage的值,等等..

      下面我声明了一个变量total_number_of_pages_to_scrape,并将值设置为10。如果你想要任何特定的数字,你可以改变它。

      代码:

      driver = webdriver.Chrome(driver_path)
      driver.maximize_window()
      driver.implicitly_wait(50)
      #driver.get("https://www.kraken.com")
      wait = WebDriverWait(driver, 20)
      total_number_of_pages_to_scrape = 7
      first_page = 1
      Name = []
      Date_ = []
      for i in range(total_number_of_pages_to_scrape):
          driver.get(f'https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage={first_page}')
          if first_page == total_number_of_pages_to_scrape or first_page >= total_number_of_pages_to_scrape:
              break
      
          time.sleep(1)
      
          # append the data to list
          CompanyName = driver.find_elements_by_xpath('/html/body/div[1]/div[3]/div[2]/section/section/table/tbody/tr/td[4]')
          Date = driver.find_elements_by_xpath('/html/body/div[1]/div[3]/div[2]/section/section/table/tbody/tr/td[1]')
      
      
          for i in range(len(CompanyName)):
              print(CompanyName[i].text)
              #Name.append(CompanyName[i].text)
      
          data = pd.DataFrame(Name)
      
          for i in range(len(Date)):
              print(Date[i].text)
              Date_.append(Date[i].text)
      
          data['Date_'] = Date
          data
      
          first_page = first_page + 1
      

      输出:

      3rd Rock Logistics, LLC
      3rd Rock Holdings, Inc.
      Philippine Airlines, Inc.
      Bennett Rosa, LLC
      James David Theros
      Mich's Maccs, LLC
      RECON MEDICAL, LLC
      James David Theros
      Joseph Smart, Jr
      RECON MEDICAL, LLC
      U.S. Capital Investments LLC
      County Investment L.P.
      Massood Danesh Pajooh
      Long Valley Real Estate LLC
      Heilongjiang Barn, LLC
      Verdant Holdings, LLC
      Anthony Narancic
      Specialty Orthopedic Group Tennessee, PLLC
      Verdant Holdings, LLC
      Affordable Concrete, LLC
      San Diego Taco Company, Inc.
      Long Valley Real Estate LLC
      Sepideh Sally Cirino
      Heilongjiang Barn, LLC
      Terra Santa, Inc.
      BESTHOST INN LLC
      MOVIMIENTO PENTECOSTAL APOSTOLICO CRISTIANO, INCOR
      Terra Santa, Inc.
      Rickenbaker Gin, Inc.
      Results Fitness, LLC
      Roark & Associates LLC
      Bexar County Properties, A Cal Ltd PSHIP
      Jean Pierre Rwigema
      Alibaba's Terrace Inc.
      Scott Doren Spinner and Alicia Margarita Spinner
      ARK Innovations Limited Liability Company
      Thai Stk, Inc.
      Di-Chem and Quality Technology, LLC
      Alibaba's Terrace Inc.
      Scott Doren Spinner and Alicia Margarita Spinner
      SPL Partners LLC
      A Thumbs Up Inc.
      Di-Chem and Quality Technology, LLC
      Rockworx, Inc.
      Amado Amado Salon & Body Corp.
      Dean A. Ditmar and Kelly E. Ditmar
      SHURWEST, LLC
      ARK Innovations Limited Liability Company
      B N Empire, LLC
      GBT Promotions LLC
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/03/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/02/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      09/01/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      SBG Universe Brands, LLC
      SBG-Gaiam Holdings, LLC
      Gaiam Americas, Inc.
      Gaiam Brand Holdco, LLC
      Joe's Holdings, LLC
      LNT Brands, LLC
      American Sporting Goods Corp.
      The Basketball Marketing Company, Inc.
      Galaxy Brands, LLC
      SBG FM, LLC
      Brand Matter, LLC
      Heeling Sports Limited
      William Rast Licensing, LLC
      Sequential Licensing, Inc.
      SQBG, Inc.
      Itzhak Meir Shtark and Ayala Shtark
      Heather Karen Therese Stanley-Christian
      SQBG, Inc.
      Sequential Brands Group, Inc.
      Newstream Hotels and Hospitality, LLC
      Newstream Hotel Partners-ABQ, LP
      Sumak Kawsay LLC
      Greg and Alice Logging, Inc
      B N Empire, LLC
      Itzhak Meir Shtark and Ayala Shtark
      Heather Karen Therese Stanley-Christian
      GGS Pizzeria, Inc.
      Herman Alex Molina
      GGS Pizzeria, Inc.
      BL Santa Fe (Mezz), LLC
      BL Santa Fe, LLC
      Herman Alex Molina
      St. Croix Custom Pools, L.L.C.
      Linda Marie Kingsbury
      Sid Boys Corp.
      Phoenix Roofing & Construction FL, Inc.
      Sid Boys Corp.
      Kay W Eubanks
      Tukhi Business Group, LLC
      Roberto C. Hernandez
      Roberto C. Hernandez
      Michael Zollicoffer, MD PA
      Kay W Eubanks
      World Service West/LA Inflight Service Company LLC
      John E Mayer
      NESV Tennis, LLC
      RHCSC Gainesville Health Holdings LLC
      RHCSC Gainesville Health Holdings LLC
      RHCSC Gainesville AL Holdings LLC
      RHCSC Gainesville AL Holdings LLC
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/31/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/30/2021
      08/29/2021
      08/28/2021
      08/28/2021
      08/28/2021
      08/28/2021
      08/27/2021
      08/27/2021
      08/27/2021
      08/27/2021
      08/27/2021
      08/27/2021
      08/27/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      Johnson + Associates Architects Inc.
      NESV Swim, LLC
      NESV Land East, LLC
      NESV Land, LLC
      NESV Hotel, LLC
      NESV Field, LLC
      NESV Ice, LLC
      RHCSC Savannah AL Holdings LLC
      RHCSC Montgomery II Health Holdings LLC
      RHCSC Montgomery II AL Holdings LLC
      RHCSC Montgomery I Health Holdings LLC
      RHCSC Social Circle Health Holdings LLC
      RHCSC Social Circle AL Holdings LLC
      RHCSC Savannah Health Holdings LLC
      RHCSC Social Circle Health Holdings LLC
      RHCSC Social Circle AL Holdings LLC
      RHCSC Savannah Health Holdings LLC
      RHCSC Savannah AL Holdings LLC
      RHCSC Montgomery II Health Holdings LLC
      RHCSC Montgomery II AL Holdings LLC
      RHCSC Montgomery I Health Holdings LLC
      RHCSC Montgomery I AL Holdings LLC
      RHCSC Gainesville Health Holdings LLC
      RHCSC Gainesville AL Holdings LLC
      RHCSC Douglas Health Holdings LLC
      RHCSC Douglas AL Holdings LLC
      RHCSC Columbus Health Holdings LLC
      RHCSC Columbus AL Holdings LLC
      Regional Housing & Community Services Corporation
      RHCSC Rome Health Holdings LLC
      RHCSC Rome AL Holdings LLC
      488 East 98 LLC
      488 East 98 LLC
      Brandy Houston Caldwell
      22 Anchor, LLC
      WSCE Corp.
      WSCE Corp.
      Brandy Houston Caldwell
      E.L. Services, Inc.
      Monterey Mountain Property Management, LLC
      PSG Mortgage Lending Corp., a Delaware Corporation
      22 Anchor, LLC
      22 Anchor LLC
      Jay Arthur Coakley, Sr.
      The Odyssey at Paterson, LLC
      TIX4TONIGHT, LLC
      TIX CORPORATION
      Bear Valley Ranch Market & Liquor Inc
      Bear Valley Ranch Market & Liquor Inc
      Thomas, Scott A.
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/26/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/25/2021
      08/24/2021
      08/24/2021
      08/24/2021
      Nabil I Haddad and Peggy Haddad
      HTP, Inc.
      Brain Energy Holdings LLC
      Scenic 30A Investments, LLC
      Brain Energy Holdings LLC
      Jay Arthur Coakley, Sr.
      The Odyssey at Paterson, LLC
      AA Varela Properties LLC.
      JINZHENG GROUP (USA) LLC
      Westmount Group, Inc.
      Urquhart, LLC
      Advanced Tissue, LLC
      Alisha, LLC
      Daniel R. Roubein MD PA
      Daniel R. Roubein
      AA Varela Properties LLC.
      Brown Industries, Inc.
      Broster JD LLC
      HLH Timber Company LLC
      W. E. McDonald & Son, LLC
      Kenneth Olakunle Shobola
      Bridgeport Health Care Realty Co.
      Live Well Medical Centers Orlando LLC
      Line Marie Martin
      Live Well Medical Centers Orlando LLC
      Silver Plaza, LLLP
      Concrete Pavers Inc.
      Line Marie Martin
      JANA, LLC
      Semoran Pines Phase II Condominium Association, In
      Benson Property Investment Corp
      Charles A. Izzo
      Lynn Marie Lotz and Charles Edward Lotz
      City Communications, Inc.
      Treasures and Gems, Ltd
      Dane Heating & Air Conditioning, Inc.
      Semoran Pines Phase II Condominium Association, In
      OLCAN III Properties LLC
      One New Alliance, LLC
      One New Alliance, LLC
      Khosro V Farahani
      The WOW Bar, LLC
      Khosro V. Farahani
      Palace Theater, LLC
      Basic ESA, Inc. Jointly Administered under 21-90002.
      Agua Libre Midstream LLC Jointly Administered under 21-90002.
      Agua Libre Asset Co LLC Jointly Administered under 21-90002.
      Agua Libre Holdco LLC Jointly Administered under 21-90002.
      SCH Disposal, L.L.C. Jointly Administered under 21-90002.
      Taylor Industries, LLC Jointly Administered under 21-90002.
      08/24/2021
      08/24/2021
      08/24/2021
      08/24/2021
      08/24/2021
      08/24/2021
      08/24/2021
      08/24/2021
      08/24/2021
      08/23/2021
      08/23/2021
      08/23/2021
      08/23/2021
      08/23/2021
      08/23/2021
      08/23/2021
      08/20/2021
      08/20/2021
      08/20/2021
      08/20/2021
      08/20/2021
      08/20/2021
      08/20/2021
      08/20/2021
      08/19/2021
      08/19/2021
      08/19/2021
      08/19/2021
      08/19/2021
      08/19/2021
      08/19/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/18/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      Basic Energy Services LP, LLC Jointly Administered under 21-90002.
      Basic Energy Services GP, LLC Jointly Administered under 21-90002.
      Indigo Injection #3, LLC Jointly Administered under 21-90002.
      KVS Transportation, Inc. Jointly Administered under 21-90002.
      C&J Well Services, Inc. Jointly Administered under 21-90002.
      Basic Energy Services, L.P. Jointly Administered under 21-90002.
      Joseph L Sanders
      Ronald Lee Moore
      MAIN STREET INVESTMENTS III, LLC.
      C&C Construction and Management LLC
      Basic ESA, Inc.
      Agua Libre Midstream LLC
      Agua Libre Asset Co LLC
      Agua Libre Holdco LLC
      SCH Disposal, L.L.C.
      Taylor Industries, LLC
      Basic Energy Services LP, LLC
      Basic Energy Services GP, LLC
      Indigo Injection #3, LLC
      KVS Transportation, Inc.
      C&J Well Services, Inc.
      Basic Energy Services, Inc.
      Basic Energy Services, L.P.
      Mitchell K Cohen
      MAIN STREET INVESTMENTS III, LLC.
      Rosie's, LLC
      Rosie's, LLC
      Palace Theater, LLC
      MAIN STREET INVESTMENTS III, LLC.
      Deyo Transportation Services, LLC
      Mitchell K Cohen
      ALARMAS COMPUTARIZADAS, INC
      Mitchell K Cohen
      Anwarul Islam Chunnu
      D&G Construction Dean Gonzalez, LLC
      D&G Construction Dean Gonzalez LLC
      D&G Construction Dean Gonzalez LLC
      Shilo Inn, Warrenton, LLC
      Shilo Inn, Bend, LLC
      Shilo Inn, Warrenton, LLC
      Shilo Inn, Bend, LLC
      Synrgy Corp., a Nevada corporation
      Acembly, Inc., a Delaware corporation
      Synrgy Corp., a Nevada corporation
      Acembly, Inc., a Delaware corporation
      Wirta Hotels LLC
      Damico's, LLC
      Just Relax Massage and Spa, LLC
      New Holland, LLC
      New Holland, LLC
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/17/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/16/2021
      08/14/2021
      08/14/2021
      08/14/2021
      08/14/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      Wirta Hotels, LLC
      Piz Family Deli, Inc.
      DREAM DUFFEL, LLC
      Regina Cargullo Ventura
      Riverrock Recycling & Crushing, LLC
      Eztopeliz, LLC
      HPE Transportation LLC
      CARE SHARE MANAGER CORP
      1106 Montello LLC
      Energy Enterprises USA Inc. dba Canopy Energy
      Wasatch Co.
      Wasatch Co.
      EZTOPELIZ, LLC
      HPE Transportation LLC
      Vikram Srinivasan
      Rickert Landscaping, Inc.
      Moon Wholesale, Inc.
      Moon Site Management, Inc.
      Moon Nurseries, Inc.
      Moon Landscaping, Inc.
      Moon Group, Inc.
      1106 Montello LLC
      Moon Group, Inc.
      Luis Arevalo
      Luis Arevalo
      Spectrum Link, Inc.
      S & N Property, L.L.C.
      SOCAL MRO LLC
      Doris E Melendez
      Bristol Properties LLC
      Offcamber, LLC
      Doris E Melendez
      S & N Property, L.L.C.
      Mulato Green Group, LLP
      Mr. Camper, L.L.C. d/b/a Yogi Bear's Jellysto
      Frank Bernarducci
      Frank Bernarducci
      May Contracting, Inc.
      Gagik Sargsyan
      Felisha S. Cottrell
      926 Ventura Avenue LLC
      Gagik Sargsyan
      Travis Ryan Lee
      EDUCATIONAL TECHNICAL COLLEGE INC
      Innerline Engineering, Inc.
      Secondwave Corporation
      Secondwave Corporation
      Piotr M Gawron
      Innerline Engineering, Inc.
      CHRISTINE LOUISE LAZNICKA and ANTHONY ROBERT LAZNICKA
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/13/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/12/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/11/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/10/2021
      08/09/2021
      08/09/2021
      08/09/2021
      08/09/2021
      

      【讨论】:

      • 您好,先生,我试过了,但这只是给出了第 3 页上的结果。
      • 你的意思是它可以刮到 3 页吗?不是在那之后?
      • 每页有 50 个结果。我使用了 total_number_of_pages_to_scrape = 4 和 first_page = 1。但它给了我第 3 页上仅有的 50 个结果。
      • 它应该首先从第一页抓取,然后继续直到在提到的变量中定义的指定页面。
      • 但它不是那样运行的。你能帮忙吗?
      【解决方案3】:

      url 不依赖于 javascript,这就是为什么不需要使用 selenium。您只能使用 pandas 数据框和请求来执行此操作,并且我已经在起始 url 中进行了分页,您可以根据需要增加或减少页码范围。

      我的代码:

      import requests
      import pandas as pd
      
      headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36'}
      
      urls = ["https://www.dailydac.com/chapter-11-bankruptcy-alert-system/?cpage="+str(x)+"" for x in range(1,10)]
      for url in urls:
          req = requests.get(url,headers=headers)
      
          wiki_table = pd.read_html(req.text, attrs = {"class":"filings-table"} )
      
          df = wiki_table[0]
      
          print(df)
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2011-01-14
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-03-04
        • 2021-01-16
        • 1970-01-01
        • 2011-09-09
        相关资源
        最近更新 更多